Apply

Senior Site Reliability Engineer

Posted about 2 hours agoViewed

View full description

๐Ÿ’Ž Seniority level: Senior, 5+ years

๐Ÿ“ Location: United States, Europe, EST, GMT, CEST

๐Ÿ” Industry: Software Development

๐Ÿข Company: Dune๐Ÿ‘ฅ 101-250

๐Ÿ—ฃ๏ธ Languages: English

โณ Experience: 5+ years

๐Ÿช„ Skills: DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Requirements:
  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
Responsibilities:
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.
Apply

Related Jobs

Apply

๐Ÿ“ United States

๐Ÿงญ Full-Time

๐Ÿ’ธ 128350.0 - 192100.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: ClickHouse๐Ÿ‘ฅ 101-250๐Ÿ’ฐ Series B over 2 years agoDatabaseArtificial Intelligence (AI)Big DataAnalyticsSoftware

  • At least 8 years of experience in Site Reliability Engineering or a related field.
  • Previous experience using ClickHouse in production.
  • Coding experience with Go and/or Python.
  • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
  • Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm.
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
  • Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  • Continuously improve the reliability and performance of our ClickHouse services.
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

AWSDockerPythonSQLCloud ComputingKubernetesCross-functional Team LeadershipClickhouseGoREST APICommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformExcellent communication skillsTeamworkStrong communication skillsAnsibleDebugging

Posted about 3 hours ago
Apply
Apply

๐Ÿ“ Germany, Spain, Portugal

๐Ÿข Company: Jobgether๐Ÿ‘ฅ 11-50๐Ÿ’ฐ $1,493,585 Seed about 2 years agoInternet

  • 5+ years of experience in a Site Reliability Engineer or similar role.
  • 3+ years of experience with AWS services and container orchestration tools.
  • 2+ years of Kubernetes experience.
  • Strong knowledge of observability tools and principles (monitoring, logging, tracing).
  • Hands-on experience with Terraform for infrastructure as code.
  • Proficiency in at least one programming language (e.g., Python, Go, Java).
  • Experience in incident management, postmortem analysis, and risk mitigation.
  • Familiarity with messaging systems like SNS, SQS, and experience with CI/CD tools.
  • Develop and maintain systems that are reliable, scalable, and efficient.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure optimal system performance.
  • Conduct blameless post-incident reviews, identify root causes, and implement preventive actions.
  • Automate operational tasks, incident responses, and contribute to system performance optimizations.
  • Work with engineering teams to ensure systems are designed for reliability, scalability, and maintainability.
  • Continuously evaluate and improve system performance, capacity, and cost efficiency.
  • Participate in the on-call rotation, providing troubleshooting and resolution support for critical issues.

AWSPythonJavaKubernetesGoCI/CDRESTful APIsLinuxTerraformScripting

Posted 1 day ago
Apply
Apply

๐Ÿ“ United States

๐Ÿ” Blockchain

๐Ÿข Company: IO Global

  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Ability to engage in technical discussions and be part of the decision making process
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
  • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
  • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
  • Leverage GitOps principles to automate deployments and manage container orchestration.
  • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
  • Develop automation tools and scripts to improve operational efficiency.
  • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
  • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
  • Collaborate with dev teams to define and implement SLOs/SLIs
  • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
  • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
  • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
  • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 1 day ago
Apply
Apply

๐Ÿ“ United Kingdom

๐Ÿ” Software Development

๐Ÿข Company: StarRez๐Ÿ‘ฅ 251-500๐Ÿ’ฐ Private about 3 years agoConsultingSaaSProperty ManagementSoftware

  • 1+ years experience working on a SaaS platform
  • Proven experience (2+ Years) in a Platform Engineering, Site Reliability Engineering or Software Engineering role.
  • Proficiency in at least one (or more) object-oriented programming language (C# preferable)
  • Production experience operating containerization technologies (Kubernetes).
  • Proficiency with one or more public cloud providers such as Azure, AWS or GCP
  • Proficiency using Infrastructure as Code (IaC) tools such as Terraform (preferred), Ansible, or CloudFormation.
  • Proficiency in scripting and automation using languages like Bash, PowerShell or Python.
  • Experience with monitoring, observability and logging tools such as DataDog, Prometheus, Grafana, or similar.
  • Proven track record of maintaining highly-available and performant production environments.
  • Ability to identify and implement effective mitigation strategies and operational playbooks.
  • Provide technical leadership and mentoring within the team through knowledge sharing sessions, pair programming, code reviews and solution design
  • Identify and implement solutions to improve platform reliability, including the creation of mitigation strategies and operational playbooks.
  • Implement and maintain monitoring/alerting/logging systems to identify and respond to incidents
  • Conduct/participate in Root Cause Analyses (RCAs) and blameless post-mortems
  • Participate in on-call rotations to ensure system reliability and rapid incident response.
  • Ensure scalability and efficiency of cloud infrastructure and systems to handle traffic and data growth
  • Conduct performance tests to identify and remediate bottlenecks
  • Develop and maintain platform solutions, automate infrastructure provisioning, configuration, and management tasks using Infrastructure as Code.
  • Monitor, review and tune databases to ensure high availability and performance
  • Collaborate with product engineering teams to design/build fit-for-purpose and observable software
  • Contribute and collaborate across teams to define Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level Agreements (SLAs) as required

AWSDockerPythonSQLBashGCPKubernetesC#AzureGrafanaPrometheusCI/CDDevOpsTerraformAnsibleSoftware EngineeringSaaS

Posted 3 days ago
Apply
Apply

๐Ÿ“ United States, Canada, Latin America

๐Ÿงญ Full-Time

๐Ÿ’ธ 160000.0 - 185000.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: Superhuman๐Ÿ‘ฅ 51-200๐Ÿ’ฐ $75,000,000 Series C over 3 years ago๐Ÿซ‚ Last layoff almost 3 years agoSoftware Development

  • 6+ years of experience in SRE, DevOps, or systems engineering roles.
  • Proven experience managing high-availability, mission-critical systems.
  • Strong proficiency with cloud platforms (GCP, AWS, or Azure).
  • Hands-on experience with containers and orchestration tools (Docker, Kubernetes).
  • Expertise in monitoring, logging, and alerting tools (e.g., Metabase, Datadog, Prometheus, Grafana, etc).
  • Proficiency in scripting/programming languages (Python, Go, Bash, etc.).
  • Knowledge of database management systems (SQL/NoSQL).
  • Strong knowledge of networking, security, and distributed systems.
  • Experience with Infrastructure as Code (Terraform, Ansible, Chef, or Puppet).
  • Familiarity with version control systems (Git) and CI/CD pipelines (Jenkins, GitLab CI, etc.).
  • Strong communication skills and ability to work collaboratively across teams.
  • Problem-solving mindset with a focus on root cause analysis.
  • Proactive, self-driven, and able to handle high-pressure environments.
  • Collaborate with software engineers to design scalable, fault-tolerant systems and services.
  • Proactively monitor service health, availability, and performance.
  • Respond to and troubleshoot production issues.
  • Perform capacity planning and scaling activities.
  • Automate repetitive tasks to enhance efficiency.
  • Design and implement disaster recovery plans and high availability strategies.
  • Collaborate with our security team to ensure infrastructure adheres to best practices and compliance requirements.
  • Build, maintain, and enhance CI/CD pipelines.
  • Manage and automate infrastructure provisioning and configuration.
  • Work closely with development teams to ensure best practices in deployment and release processes.
  • Champion DevOps culture by mentoring and guiding other engineers in the use of tools and best practices.

AWSDockerPythonSQLBashCloud ComputingGCPGitJenkinsKubernetesAzureGoGrafanaPrometheusNosqlCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingAnsibleScripting

Posted 6 days ago
Apply
Apply

๐Ÿ“ United Kingdom, Canada

๐Ÿ” Software Development

๐Ÿข Company: GoDaddy๐Ÿ‘ฅ 5001-10000๐Ÿ’ฐ $800,000,000 Post-IPO Equity about 3 years ago๐Ÿซ‚ Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • Experience with REST APIs
  • Experience with testing code
  • Experience with Docker and other container-related technologies
  • Experience with Python or similar languages
  • Experience with Hashicorp Vault or other similar tooling
  • Engage with engineers and partners to solve problems
  • Lead by example with high coding standards
  • Improve the observability of production services
  • Share expertise by training and guiding other engineers

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted 13 days ago
Apply
Apply

๐Ÿ“ United States, European timezones

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: Invert๐Ÿ‘ฅ 11-50๐Ÿ’ฐ $20,149,993 Seed 8 months agoData ManagementSaaSApplication Performance Management

NOT STATED
  • Design, build, and maintain scalable and secure cloud infrastructure as code
  • Develop and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability
  • Enable cost transparency and optimize infrastructure spending
  • Reduce cognitive load for product engineers by creating streamlined, efficient development workflows
  • Build and maintain robust CI/CD pipelines that accelerate time from code to customer
  • Create and maintain intuitive, comprehensive observability solutions for end-to-end system monitoring
  • Lead and continuously improve our Incident Management process
  • Participate in the on-call rotation, serving as a First Responder to quickly address and resolve system issues
  • Develop and maintain incident response playbooks and post-mortem practices

AWSDockerCI/CDLinuxTerraform

Posted 22 days ago
Apply
Apply

๐Ÿ“ Europe

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: Sanity๐Ÿ‘ฅ 51-200๐Ÿ’ฐ Corporate over 2 years agoSoftware Development

  • Proven experience with SRE/DevOps tools, processes, and culture.
  • Proficient in programming languages like Python, Go, and TypeScript.
  • 5+ years of experience participating in an SRE on-call rotation.
  • Analytical mindset for designing, diagnosing, and optimizing infrastructure.
  • Skilled in managing scalable, highly available, cloud-based applications.
  • Hands-on experience with Kubernetes for orchestrating, scaling, and managing containerized applications in the cloud.
  • Strong database management skills, particularly with PostgreSQL.
  • Experience with infrastructure as code, using tools like Terraform.
  • Proficient in building and maintaining CI/CD pipelines.
  • Familiarity with observability tools like Prometheus and similar stacks.
  • Calm and clear-headed in incident and outage situations, with a thoughtful communication style for high-pressure environments.
  • Open-minded yet discerning when it comes to exploring new technologies.
  • Plan and implement a global platform for delivering our software as a service.
  • Diagnose and troubleshoot complex distributed systems.
  • Ensure observability and analyze the behavior of our stack.
  • Orchestration, deployment, monitoring, automation.
  • Participate in our on-call rotation.

PostgreSQLPythonCloud ComputingElasticSearchKubernetesTypeScriptGoPrometheusCI/CDLinuxDevOpsTerraformMicroservices

Posted 22 days ago
Apply
Apply

๐Ÿ“ United States, Canada

๐Ÿงญ Full-Time

๐Ÿ’ธ 100000.0 - 120000.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted 26 days ago
Apply
Apply

๐Ÿ“ USA, CAN, MEX

๐Ÿ” Transportation technology

๐Ÿข Company: Fleetio

  • 5+ years of AWS Experience.
  • 3+ years Kubernetes Experience.
  • Ruby on Rails experience.
  • Expert at profiling and benchmarking source code.
  • Effective at code review, and identifying potential performance problems before they reach production.
  • Experience with Datadog or other APM tools.
  • Excellent written and verbal communication skills.
  • Manage cloud infrastructure using Infrastructure as Code.
  • Manage and scale a Ruby on Rails stack.
  • Implement monitoring tools to improve observability.
  • Perform code review of new features to ensure they meet performance requirements.
  • Debug production issues across all levels of the stack.
  • Plan for the growth of, optimize, and automate Fleetioโ€™s Infrastructure.

AWSCloud ComputingKubernetesRuby on RailsCI/CDTerraformMicroservices

Posted about 1 month ago
Apply