Apply

Senior Site Reliability Engineer

Posted 7 months agoViewed

View full description

💎 Seniority level: Senior, 5+ years

📍 Location: North America

🔍 Industry: Incident Management Platform

🏢 Company: Rootly👥 11-50💰 $12,000,000 Series A over 1 year agoDeveloper ToolsDeveloper PlatformProductivity ToolsSaaSInformation TechnologySoftware

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Requirements:
  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.
Responsibilities:
  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).
Apply

Related Jobs

Apply

📍 USA

🧭 Full-Time

💸 140000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Juniper Square

  • 5+ years of experience in SRE, DevOps, or Infrastructure Engineering with a proven track record of ownership and initiative.
  • Strong experience with Kubernetes, Helm, and CNIs, including networking and security.
  • Proficiency in AWS services such as RDS, Aurora, IAM, VPC, EKS, and EC2.
  • Experience in PostgreSQL administration, including performance tuning and high availability in RDS/Aurora.
  • Hands-on experience with GitHub Actions and ArgoCD for secure and scalable CI/CD automation.
  • Strong background in Infrastructure as Code (IaC) with Crossplane and Terraform.
  • Deep understanding of observability and monitoring with Datadog.
  • Experience with Kyverno for Kubernetes policy-based security enforcement.
  • Proficiency in Python and Bash scripting for automation and system management.
  • Strong understanding of CI/CD security best practices and ability to implement controls for securing deployments.
  • Own reliability and scalability initiatives—identify, prioritize, and implement solutions before issues escalate.
  • Participate in an on-call rotation, responding to incidents, performing root cause analysis, and driving long-term fixes.
  • Design, deploy, and manage Kubernetes clusters using Helm charts, Cilium, and Karpenter to optimize performance and cost.
  • Architect and maintain AWS infrastructure with a focus on RDS/Aurora PostgreSQL, networking, and scaling best practices.
  • Implement GitHub Actions CI/CD pipelines, integrating security best practices and automation.
  • Define and enforce policy-based security for Kubernetes using Kyverno.
  • Automate infrastructure provisioning with Crossplane and Terraform to ensure consistency and scalability.
  • Enhance observability and monitoring using Datadog to proactively detect and resolve issues.
  • Improve security and reliability by identifying risks in CI/CD, cloud environments, and Kubernetes, then implementing necessary safeguards.
  • Lead post-incident reviews, drive lessons learned into long-term improvements, and document best practices in Confluence.

AWSPostgreSQLPythonBashKubernetesCI/CDDevOpsTerraform

Posted about 4 hours ago
Apply
Apply

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

  • 5+ years of Ruby/Rail Experience
  • 3+ years of AWS Experience
  • Kubernetes experience
  • Experience with profiling and benchmarking source code
  • Effective at code review, and identifying potential performance problems before they reach production
  • Experience with Datadog or other APM tools
  • Excellent written and verbal communication skills
  • Proactively identify, triage, and resolve performance issues
  • Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
  • Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
  • Optimize performance through instance configuration and monitoring
  • Collaborate with other SREs to proactively identify and address performance bottlenecks
  • Lead database capacity planning and upgrade initiatives
  • Manage the database-specific components of disaster recovery planning and execution
  • Oversee backup systems and pre-production databases
  • Create and maintain infrastructure and operations documentation
  • Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago
Apply
Apply

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

  • 5+ years of experience building, iterating upon, and maintaining corporate IAM systems
  • 5+ years of experience with operational procedures and application development
  • Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
  • Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
  • Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
  • Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
  • Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
  • Strong understanding of CI/CD workflows, automation frameworks, and best practices
  • Clear communication—demonstrate ability to explain technical concepts simply
  • Self starter—possess a continuous learning mindset
  • Demonstrate critical thinking under pressure
  • Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
  • Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
  • Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
  • Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
  • Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
  • Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
  • Deliver configurations and maintain state using configuration management tools
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives
  • Define metrics and bolster monitoring/observability across corporate IAM systems
  • Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 9 days ago
Apply
Apply

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Must know how to secure Mac and Linux endpoints.
  • Python and bash experience is a must.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the team’s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

  • 5+ years of experience as a Site Reliability Engineer or similar role.
  • Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
  • Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
  • Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
  • Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
  • Strong problem-solving and analytical skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and take ownership of complex tasks.
  • Passion for technology and a desire to learn and grow.
  • Experience with Jenkins, PostgreSQL, and MongoDB.
  • Experience with cloud cost optimization, security best practices and tools.
  • Experience working in a fast-paced, agile environment.
  • Experience Rancher, Cattleprod, and TeamCity a plus.
  • Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
  • Implement automated monitoring and alerting systems for early detection of potential problems.
  • Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
  • Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
  • Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
  • Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
  • Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
  • Help onboard and mentor junior team members, sharing your knowledge and expertise.
  • Stay up to date on the latest cloud technologies and best practices for SRE.
  • Participate in a well-structured on-call rotation with other Site Reliability Engineers.
  • Explore new technologies and innovative solutions to improve service quality and speed to market.
  • Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 14 days ago
Apply
Apply

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 21 days ago
Apply
Apply

📍 United States

🧭 Full-Time

🔍 Blockchain

🏢 Company: IO Global

  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Rust experience is advantageous
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Ability to engage in technical discussions and be part of the decision making process
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
  • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
  • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
  • Leverage GitOps principles to automate deployments and manage container orchestration.
  • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
  • Develop automation tools and scripts to improve operational efficiency.
  • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
  • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
  • Collaborate with dev teams to define and implement SLOs/SLIs
  • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
  • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
  • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
  • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 22 days ago
Apply
Apply

📍 USA

🧭 Full-Time

💸 160000.0 - 220000.0 USD per year

🔍 Healthcare

🏢 Company: Clover Health👥 501-1000💰 $300,000,000 Post-IPO Equity over 3 years ago🫂 Last layoff almost 2 years agoMedicalHealth InsuranceHospitalHealth Care

  • 5+ years of programming experience and are proficient in at least one of the following languages: Python, Go, or Shell Scripting.
  • In-depth knowledge of containerization technologies and orchestration, such as Docker, Containerd, and Kubernetes, along with experience with CNCF-based technologies like Helm, gRPC, and Prometheus.
  • Experience with public cloud platforms such as GCP, Azure, or AWS.
  • Knowledgeable in networking fundamentals, including TCP/IP, UDP, firewalls, routing, DNS, and load balancing.
  • Experience with Linux system administration and a solid understanding of Linux design principles.
  • Understand key SRE concepts, such as monitoring, performance tuning, and automation.
  • Can work autonomously with limited guidance, proactively identifying and solving problems.
  • Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams and adapt to new challenges and evolving technologies.
  • Build systems for declarative application and infrastructure lifecycle management, including continuous deployment, continuous integration, Kubernetes cluster management, and service/workload inventory.
  • Prioritize and troubleshoot infrastructure issues, minimizing downtime and responding to alerts efficiently.
  • Contribute to setting the direction of the Site Reliability Engineering (SRE) team, ensuring goals align with Counterpart Health’s company-wide objectives.
  • Foster a collaborative, high-performance culture that promotes motivation, innovation, and cross-disciplinary teamwork.
  • Streamline and automate infrastructure processes, including delivery pipelines and database changes.

AWSDockerPythonBashGCPKubernetesAzureGogRPCPrometheusCI/CDLinuxTerraformNetworkingAnsible

Posted about 1 month ago
Apply
Apply

📍 United Kingdom, Canada

🔍 Software Development

🏢 Company: GoDaddy👥 5001-10000💰 $800,000,000 Post-IPO Equity over 3 years ago🫂 Last layoff over 1 year agoWeb HostingDomain RegistrarWeb DevelopmentOnline Portals

  • A track record of delivering capabilities that build customer value and business impact.
  • Knowledge of principles for building performant and quality REST APIs.
  • Experience with testing code, care of and feeding of both on-premises as well as cloud compute systems, Docker and other container-related technologies, Python or similar languages, Hashicorp Vault or other similar tooling.
  • Engage with engineers and partners across the organization to solve problems with broad impact, stay ahead of the curve with new technologies, and advocate for modern and effective tech stacks.
  • Lead by example with a high standard for coding practices, including practical coding standards, modern software development approaches, test automation, and a strong focus on security.
  • Improve the observability of our production services, allowing the team to quickly highlight gaps, resolve issues, and understand the performance of our systems.
  • Share your expertise by training and guiding other engineers, encouraging a collaborative and nurturing environment for learning.

Backend DevelopmentDockerPythonCloud ComputingKubernetesAmazon Web ServicesREST APICI/CDLinuxAnsible

Posted about 1 month ago
Apply
Apply
🔥 Senior Site Reliability Engineer
Posted about 2 months ago

📍 United States, Canada

🧭 Full-Time

💸 100000.0 - 120000.0 USD per year

🔍 Software Development

🏢 Company: AssuredCloud Data ServicesB2BCloud SecurityCyber Security

  • Experience in a start-up environment
  • Design and maintain highly available database solutions, ideally PostgreSQL
  • Experience with compliance and security regulations (SOC 2, HIPPA, ISO 27001)
  • Strong engineering background
  • Knowledge of Node.js, Python, Docker, PostgreSQL, GraphQL (not required)
  • Provision infrastructure and tooling
  • Create automated tooling to maintain the platform
  • Build methods for monitoring and scaling services
  • Implement security compliance strategies
  • Lead and mentor engineering team

AWSDockerNode.jsPostgreSQLPythonTerraformCompliance

Posted about 2 months ago
Apply