Apply

Site Reliability Engineer

Posted 1 day agoViewed

View full description

📍 Location: United States

💸 Salary: 90000.0 - 145000.0 USD per year

🏢 Company: Bitwarden👥 101-250💰 $100,000,000 Series B over 2 years agoPrivacyCyber SecurityEnterprise SoftwareIdentity ManagementSoftware

🗣️ Languages: English

🪄 Skills: PythonCloud ComputingGitKubernetesC#GoCI/CDRESTful APIsLinuxDevOpsTerraformAnsible

Requirements:
  • Experience with multi-region deployments in public cloud environments
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
  • Fluency with least one programming language, such as C#, Python, Go, etc
  • Working knowledge with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
  • Proficiency using source control such as Git
  • Ability to maintain discretion, handle sensitive information, and improve security best-practices
  • Technocrat at heart, staying current with trends and new technologies
  • Collaborative and adaptable mindset
  • Openness and authenticity combined with excellent communication skills
  • Excitement and enthusiasm for open source and for better internet security
  • Excellent problem-solving skills – you might not know all the answers, but you know how to find and communicate the possible solutions
Responsibilities:
  • Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
  • Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
  • Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
  • Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
  • Active participation in code reviews, learning and spreading technical knowledge
  • Contribute and mature incident management/escalation processes
  • Collaborate with cross functional teams to refine priorities and deliverables
  • Ongoing engagement with product owners to align SLI/SLOs/SLAs
  • Evaluate and identify opportunities for new initiatives to support organizational needs
Apply

Related Jobs

Apply

📍 United States

💸 135000.0 - 170000.0 USD per year

🔍 Fintech

🏢 Company: Kunai👥 51-100ConsultingFinancial ServicesInformation TechnologyFinTechSoftware

  • 6+ years in SRE, DevOps, or infrastructure roles, ideally supporting distributed systems or microservices
  • 3+ years in AWS or other public cloud providers
  • Hands-on experience with SRE tooling: observability, alerting, incident response (e.g., New Relic, PagerDuty, Splunk, OpenTelemetry)
  • Proficiency in at least one programming language such as Go, Java, or Python
  • Experience with automation testing or performance testing tools like Selenium, Postman, Cucumber, JMeter, or similar
  • Strong understanding of CI/CD principles and experience with build and deployment automation
  • Build and maintain tools that improve the reliability, performance, and availability of platform services
  • Implement observability practices and help integrate SRE tooling such as New Relic, OpenTelemetry, and Splunk
  • Collaborate with development and infrastructure teams to resolve incidents, improve monitoring, and optimize system performance
  • Help automate and streamline deployment, recovery, and testing processes
  • Contribute to shared engineering patterns and support adoption across multiple domain teams

AWSPythonJavaJMeterKubernetesGoSeleniumCI/CDLinuxDevOpsMicroservices

Posted 1 day ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 240000.0 - 400000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D about 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 10+ years of experience in a DevOps or SRE role
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with IaC tools like Terraform (preferred) or similar
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Deep understanding of SRE practices, such as SLOs, Error Budgets, PRRs, Problem Management
  • Comfortable with a high level of autonomy and working with a distributed team
  • Chart the future of Cribl’s observability and reliability systems and practices
  • Conceptualize and direct the evolution of our reliability metrics, programs and process based on the state of the art and industry best practices
  • Engage with Product and Engineering teams to improve service delivery and reliability across the entire software lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Uncover risks and seek out the sources of errors and instability in our production systems.
  • Advocate engineering-wide improvements in reliability, observability and promote antifragility
  • Identify and drive down toil with creative innovation and automation
  • Participate in on-call

AWSDockerNode.jsSQLCloud ComputingGitJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 2 days ago
Apply
Apply

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

  • DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
  • Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
  • Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
  • Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
  • Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
  • Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
  • Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.
  • Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
  • Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
  • Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
  • Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
  • Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
  • Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
  • Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingElasticSearchGCPKubernetesAlgorithmsAzureData StructuresGoGrafanaPrometheusCollaborationCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformTroubleshootingAnsibleScripting

Posted 5 days ago
Apply
Apply

📍 United States of America

💸 150000.0 - 160000.0 USD per year

🏢 Company: external-northamerica

  • Bachelor’s degree in Computer Science, Engineering, or related field
  • A minimum of 10 years of experience, including at least 5 years in the SRE field, with a proven track record of progressively increasing responsibilities
  • Demonstrated ability to work with cross-functional Development, QE and Operations teams
  • Strong understanding and experience in automation tools and programming/scripting languages (e.g., PowerShell, Python, Bash) to deliver improvements at a small and large scale.
  • Strong understanding of Observability tools (e.g., Dynatrace, Datadog, New Relic etc.) and best practices, to implement effective monitoring of SLI/SLO/SLAs.
  • Strong experience and understanding of software engineering, Infrastructure as Code (Ansible or Terraform) and build/deployment pipelines.
  • Strong troubleshooting skills coupled with making data-driven decisions during incidents, to improve time to detect and resolve issues.
  • Strong understanding of cloud computing platforms (Azure or Google Cloud) and cloud-native setups (AKS, serverless, etc.).
  • Contribute significantly to the reliability, scalability and availability of Bright Horizons' digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure.
  • Implement robust infrastructure, application and digital-experience monitoring in our enterprise-wide APM tool Dynatrace. Proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
  • Drive troubleshooting of critical incidents through developing a deep and broad understanding of our enterprise architecture across all 7 OSI layers.
  • Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the Product, Engineering and SRE teams.
  • Besides owning Observability tools, create a roadmap to expand and consolidate. This should provide a 360-degree view of cross-functional areas like SRE, DevOps, Application Support, Monitoring, Incident Management, Infrastructure and Enterprise Architecture.
  • Collaborate with the above cross-functional teams to drive a unified approach to site reliability that optimizes their work and improves time-to-market for all respective objectives.
  • Work closely with Infrastructure and Architecture teams to design and implement roadmaps for scaling server and serverless architecture using Containers as well as IaC tools like Ansible, Terraform etc.

AWSPythonBashCloud ComputingKubernetesAzureCI/CDDevOpsTerraformAnsibleScriptingSoftware Engineering

Posted 7 days ago
Apply
Apply

📍 USA

🧭 Full-Time

🔍 Software Development

🏢 Company: Close

  • 5+ years of experience building modern infrastructure systems.
  • Familiarity with AWS, Terraform, Kubernetes, Ansible, MongoDB, PostgreSQL, Elasticsearch
  • Strong grasp of common networking and data transfer protocols such as DNS, HTTP, TCP
  • Fully automating our database’s lifecycles with Argo Workflow
  • Eliminating all static credentials where they may be
  • Reducing downtime and disruption due to maintenance or disaster to new lows
  • Help us improve our multi-region disaster recovery system.

AWSPostgreSQLBashElasticSearchGitKubernetesMongoDBClickhouseGrafanaCI/CDLinuxDevOpsTerraformNetworkingAnsibleScripting

Posted 8 days ago
Apply
Apply

📍 USA

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

  • Proven experience as a Site Reliability Engineer (SRE) or similar role.
  • Strong understanding of AI technologies and platforms.
  • Experience with deploying and managing applications in a cloud environment (AWS/GCP).
  • Solid backend development experience with programming languages such as Python, Java, or Go.
  • Strong proficiency in managing and configuring public cloud services (AWS/GCP) for scalability and reliability.
  • Experience with automation tools and scripting (e.g., Ansible, Terraform, Bash, Python).
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and collaboration skills.
  • Strong security and compliance understanding.
  • Experience working in a highly regulated environment
  • Experience in a fast-paced, high-growth company
  • Deploy, configure, and manage AI-powered employee productivity tools and in-house AI built solutions
  • Ensure high availability, reliability, and optimal performance of AI platforms and services. Implement monitoring, alerting, and incident response procedures
  • Design and implement scalable infrastructure to support the growing demands of AI tools and user base. Optimize resource utilization and manage capacity planning
  • Develop and maintain automation scripts and tools to streamline deployment, monitoring, and maintenance tasks. Contribute to the experimental sandbox environments for testing new AI solutions
  • Collaborate with cross-functional teams (Machine-Learning, HR, Security, Data Science, Developer Experience) to support the development and integration of AI solutions. Provide technical support and troubleshooting for AI-related issues
  • Adhere to security and privacy policies while deploying and managing AI tools. Ensure compliance with regulatory requirements
  • Implement comprehensive monitoring and metrics to track the performance and health of AI systems. Analyze data to identify areas for improvement and optimization
  • Participate in incident response and troubleshooting for AI-related outages or performance issues. Develop and maintain incident response plans
  • Contribute to backend development tasks to support the integration and functionality of AI tools
  • Deploy and manage AI solutions on public cloud platforms (AWS/GCP), leveraging cloud-native services and best practices
  • Excellent communication skills and experience presenting technical information to non-technical audiences, including senior leadership

AWSBackend DevelopmentDockerPythonBashCloud ComputingGCPJavaKubernetesGoCI/CDRESTful APIsTerraformAnsibleScripting

Posted 10 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 161000.0 - 180000.0 USD per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

  • STEM degree and/or relevant experience as a Site Reliability Engineer, Devops Engineer, or SWE
  • Proficiency in Python or Golang. Will also accept experience with other compiled or high level language: C, C++, Java, Rust, etc
  • Experience running Web applications at scale
  • Experience with Web application concepts and frameworks: ORM, MVC architecture, Django, Flask, Laravel, etc
  • Proficiency with Linux administration, Bash shell, and strong knowledge of Linux internals (e.g., filesystems, system calls)
  • Strong networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
  • Experience in database administration and configuration
  • Experience with DevOps tools such as Terraform, Ansible, Docker, and Kubernetes
  • Willingness to participate in on-call rotation and respond to monitoring and alerting of core website functions as needed
  • Analyze system performance using APM and distributed telemetry data to identify sources of instability
  • Improve scalability, reliability, and performance through software enhancements and patching
  • Develop tools and automation to streamline the DevOps pipeline
  • Design and manage infrastructure in both data center metal environments and in the public cloud
  • Conduct predictive failure analysis and disaster planning
  • Administer and configure databases and key-value stores with a focus on uptime and performance
  • Analyze complex systems to identify operational surprises and minimize downtime
  • Participate in incident response and produce postmortem reports
  • Collaborate with other engineering teams

AWSDockerPostgreSQLPythonBashKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNetworkingAnsibleDebugging

Posted 15 days ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 147000.0 - 289000.0 USD per year

🔍 Software Development

🏢 Company: MongoDB👥 1001-5000💰 Post-IPO Equity about 7 years agoDatabaseOpen SourceCloud ComputingSaaSSoftware

  • 10+ years of experience working on software and operating distributed systems, with deep expertise in networking fundamentals and a good understanding of how the internet works, e.g. TCP/IP (including IPv6), DNS, TLS/mTLS, BGP, tunnels, overlays, and SDN principles
  • 2+ years of experience managing engineering teams, fostering a positive team culture and handling career growth and performance conversations
  • Intimately familiar with modern cloud-based infrastructure and the network design primitives of at least one of AWS, Azure, or GCP, e.g. VPCs, subnetting, routing, VPNs, peering, private link / private service connect, and CDNs
  • Strong knowledge of service mesh and load-balancing concepts, and be eager to implement these in a multi-cloud environment
  • Lead a team of engineers, setting direction, removing blockers, and ensuring alignment with organizational goals.
  • Oversee the development of a reliable and resilient multi-cloud globally-connected network that is crucial for MongoDB’s services
  • Collaborate with service-owning teams to provide internal support, addressing technical issues and offering guidance on best practices for service-to-service connectivity
  • Participate in a 24/7 on-call rotation to swiftly resolve issues related to network architecture and service-to-service connectivity, ensuring minimal disruption and high availability

AWSLeadershipGCPKubernetesPeople ManagementAzureCI/CDLinuxTerraformNetworkingExcellent communication skillsJSON

Posted 15 days ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 180000.0 - 185000.0 USD per year

🔍 Financial Services

🏢 Company: Reach Financial👥 51-100Financial ServicesBankingPayments

  • 5+ years of experience in Software Engineering or Site Reliability Engineering, focusing on scalability, automation, and reliability.
  • Strong coding skills in at least one language (Python, JavaScript/TypeScript, Go, or similar), with experience writing production-quality software.
  • Proficiency with CI/CD tools (GitHub Actions, ArgoCD, Jenkins, or similar), integrating security and testing into deployment pipelines.
  • Experience with containerization (Docker, OCI) and orchestration tools such as Kubernetes or AWS ECS.
  • Deep understanding of observability and monitoring using OpenTelemetry, Datadog, Prometheus, Grafana, or similar.
  • Hands-on experience with serverless architectures (AWS Lambda, Step Functions) and event-driven systems (SNS, SQS, Kafka).
  • Familiarity with Infrastructure as Code (IaC) (Terraform, OpenTofu, CloudFormation) and cloud-native architecture.
  • A collaborative mindset and excellent communication skills, working closely with Software Engineers, PDEs, and Platform teams to drive system reliability and performance.
  • Develop software that improves system reliability, scalability, and performance.
  • Collaborate with Software Engineers, SDETs, and Platform Teams to design highly available, fault-tolerant systems.
  • Build self-healing and auto-scaling systems to minimize manual intervention and eliminate toil.
  • Enhance observability by developing custom monitoring, logging, tracing, and alerting solutions.
  • Write production-quality code in Python, JavaScript/TypeScript, or other modern languages.
  • Design and enforce SLIs, SLOs, and error budgets in partnership with Product and Engineering teams.
  • Troubleshoot and resolve complex incidents, applying root cause analysis (RCA) and postmortem processes.
  • Optimize cloud infrastructure (AWS, Kubernetes, Lambda, EC2) for cost, performance, and availability.
  • Partner with Security teams to ensure compliance and best practices are integrated across all systems and processes.

AWSDockerPythonSQLCloud ComputingJavascriptKafkaKubernetesMySQLTypeScriptGoGrafanaPrometheusREST APIServerlessCommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsJSONAnsibleFinancial analysisSoftware Engineering

Posted 16 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 120000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Bitwarden👥 101-250💰 $100,000,000 Series B over 2 years agoPrivacyCyber SecurityEnterprise SoftwareIdentity ManagementSoftware

  • Expertise with multi-region deployments in public cloud environments
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
  • Strong background in Reliability Engineering, DevOps, Software Engineering
  • Fluency with least one programming language, such as C#, Python, Go, etc
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
  • Proficiency using source control such as Git.
  • Ability to maintain discretion, handle sensitive information, and improve security best-practices
  • Technocrat at heart, staying current with trends and new technologies
  • Collaborative and adaptable mindset
  • Openness and authenticity combined with excellent communication skills
  • Excitement and enthusiasm for open source and for better internet security
  • Excellent problem-solving skills – you might not know all the answers, but you know how to find and communicate the possible solutions
  • Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
  • Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
  • Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
  • Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
  • Architectural designs and engineering operations at scale
  • Active participation in code reviews, learning and spreading technical knowledge
  • Contribute and mature incident management/escalation processes
  • Collaborate with cross functional teams to refine priorities and deliverables
  • Ongoing engagement with product owners to align SLI/SLOs/SLAs
  • Evaluate and identify opportunities for new initiatives to support organizational needs
  • Evolve and influence Bitwarden’s SDLC as we scale
  • Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesGoCI/CDRESTful APIsLinuxDevOpsTerraformMicroservices

Posted 16 days ago
Apply