Site Reliability Engineer

Posted 1 day agoViewed

📍 Location: United States

💸 Salary: 90000.0 - 145000.0 USD per year

🏢 Company: Bitwarden👥 101-250💰 $100,000,000 Series B over 2 years agoPrivacy Cyber Security Enterprise Software Identity Management Software

🗣️ Languages: English

🪄 Skills: PythonCloud ComputingGitKubernetesC#GoCI/CDRESTful APIsLinuxDevOpsTerraformAnsible

Requirements:

Experience with multi-region deployments in public cloud environments
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
Fluency with least one programming language, such as C#, Python, Go, etc
Working knowledge with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
Proficiency using source control such as Git
Ability to maintain discretion, handle sensitive information, and improve security best-practices
Technocrat at heart, staying current with trends and new technologies
Collaborative and adaptable mindset
Openness and authenticity combined with excellent communication skills
Excitement and enthusiasm for open source and for better internet security
Excellent problem-solving skills – you might not know all the answers, but you know how to find and communicate the possible solutions

Responsibilities:

Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
Active participation in code reviews, learning and spreading technical knowledge
Contribute and mature incident management/escalation processes
Collaborate with cross functional teams to refine priorities and deliverables
Ongoing engagement with product owners to align SLI/SLOs/SLAs
Evaluate and identify opportunities for new initiatives to support organizational needs

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer - AWS

Posted 1 day ago

📍 United States

💸 135000.0 - 170000.0 USD per year

🔍 Fintech

🏢 Company: Kunai👥 51-100 Consulting Financial Services Information Technology FinTech Software

🔧 Requirements

6+ years in SRE, DevOps, or infrastructure roles, ideally supporting distributed systems or microservices
3+ years in AWS or other public cloud providers
Hands-on experience with SRE tooling: observability, alerting, incident response (e.g., New Relic, PagerDuty, Splunk, OpenTelemetry)
Proficiency in at least one programming language such as Go, Java, or Python
Experience with automation testing or performance testing tools like Selenium, Postman, Cucumber, JMeter, or similar
Strong understanding of CI/CD principles and experience with build and deployment automation

💡 Responsibilities

Build and maintain tools that improve the reliability, performance, and availability of platform services
Implement observability practices and help integrate SRE tooling such as New Relic, OpenTelemetry, and Splunk
Collaborate with development and infrastructure teams to resolve incidents, improve monitoring, and optimize system performance
Help automate and streamline deployment, recovery, and testing processes
Contribute to shared engineering patterns and support adoption across multiple domain teams

AWSPythonJavaJMeterKubernetesGoSeleniumCI/CDLinuxDevOpsMicroservices

Posted 1 day ago

Apply

🔥 Principal Site Reliability Engineer

Posted 2 days ago

📍 United States

🧭 Full-Time

💸 240000.0 - 400000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D about 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
10+ years of experience in a DevOps or SRE role
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with IaC tools like Terraform (preferred) or similar
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Deep understanding of SRE practices, such as SLOs, Error Budgets, PRRs, Problem Management
Comfortable with a high level of autonomy and working with a distributed team

💡 Responsibilities

Chart the future of Cribl’s observability and reliability systems and practices
Conceptualize and direct the evolution of our reliability metrics, programs and process based on the state of the art and industry best practices
Engage with Product and Engineering teams to improve service delivery and reliability across the entire software lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Uncover risks and seek out the sources of errors and instability in our production systems.
Advocate engineering-wide improvements in reliability, observability and promote antifragility
Identify and drive down toil with creative innovation and automation
Participate in on-call

AWSDockerNode.jsSQLCloud ComputingGitJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusCI/CDAgile methodologiesRESTful APIsLinuxDevOpsTerraformMicroservicesJSONScripting

Posted 2 days ago

Apply

🔥 Site Reliability Engineer

Posted 5 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingElasticSearchGCPKubernetesAlgorithmsAzureData StructuresGoGrafanaPrometheusCollaborationCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformTroubleshootingAnsibleScripting

Posted 5 days ago

Apply

🔥 Principal Site Reliability Engineer - Remote

Posted 7 days ago

📍 United States of America

💸 150000.0 - 160000.0 USD per year

🏢 Company: external-northamerica

🔧 Requirements

Bachelor’s degree in Computer Science, Engineering, or related field
A minimum of 10 years of experience, including at least 5 years in the SRE field, with a proven track record of progressively increasing responsibilities
Demonstrated ability to work with cross-functional Development, QE and Operations teams
Strong understanding and experience in automation tools and programming/scripting languages (e.g., PowerShell, Python, Bash) to deliver improvements at a small and large scale.
Strong understanding of Observability tools (e.g., Dynatrace, Datadog, New Relic etc.) and best practices, to implement effective monitoring of SLI/SLO/SLAs.
Strong experience and understanding of software engineering, Infrastructure as Code (Ansible or Terraform) and build/deployment pipelines.
Strong troubleshooting skills coupled with making data-driven decisions during incidents, to improve time to detect and resolve issues.
Strong understanding of cloud computing platforms (Azure or Google Cloud) and cloud-native setups (AKS, serverless, etc.).

💡 Responsibilities

Contribute significantly to the reliability, scalability and availability of Bright Horizons' digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure.
Implement robust infrastructure, application and digital-experience monitoring in our enterprise-wide APM tool Dynatrace. Proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
Drive troubleshooting of critical incidents through developing a deep and broad understanding of our enterprise architecture across all 7 OSI layers.
Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the Product, Engineering and SRE teams.
Besides owning Observability tools, create a roadmap to expand and consolidate. This should provide a 360-degree view of cross-functional areas like SRE, DevOps, Application Support, Monitoring, Incident Management, Infrastructure and Enterprise Architecture.
Collaborate with the above cross-functional teams to drive a unified approach to site reliability that optimizes their work and improves time-to-market for all respective objectives.
Work closely with Infrastructure and Architecture teams to design and implement roadmaps for scaling server and serverless architecture using Containers as well as IaC tools like Ansible, Terraform etc.

AWSPythonBashCloud ComputingKubernetesAzureCI/CDDevOpsTerraformAnsibleScriptingSoftware Engineering

Posted 7 days ago

Apply

🔥 Site Reliability Engineer (USA Only - 100% Remote)

Posted 8 days ago

📍 USA

🧭 Full-Time

🔍 Software Development

🏢 Company: Close

🔧 Requirements

5+ years of experience building modern infrastructure systems.
Familiarity with AWS, Terraform, Kubernetes, Ansible, MongoDB, PostgreSQL, Elasticsearch
Strong grasp of common networking and data transfer protocols such as DNS, HTTP, TCP

💡 Responsibilities

Fully automating our database’s lifecycles with Argo Workflow
Eliminating all static credentials where they may be
Reducing downtime and disruption due to maintenance or disaster to new lows
Help us improve our multi-region disaster recovery system.

AWSPostgreSQLBashElasticSearchGitKubernetesMongoDBClickhouseGrafanaCI/CDLinuxDevOpsTerraformNetworkingAnsibleScripting

Posted 8 days ago

Apply

🔥 Senior Site Reliability Engineer, Core AI Infrastructure

Posted 10 days ago

📍 USA

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

Proven experience as a Site Reliability Engineer (SRE) or similar role.
Strong understanding of AI technologies and platforms.
Experience with deploying and managing applications in a cloud environment (AWS/GCP).
Solid backend development experience with programming languages such as Python, Java, or Go.
Strong proficiency in managing and configuring public cloud services (AWS/GCP) for scalability and reliability.
Experience with automation tools and scripting (e.g., Ansible, Terraform, Bash, Python).
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Strong security and compliance understanding.
Experience working in a highly regulated environment
Experience in a fast-paced, high-growth company

💡 Responsibilities

Deploy, configure, and manage AI-powered employee productivity tools and in-house AI built solutions
Ensure high availability, reliability, and optimal performance of AI platforms and services. Implement monitoring, alerting, and incident response procedures
Design and implement scalable infrastructure to support the growing demands of AI tools and user base. Optimize resource utilization and manage capacity planning
Develop and maintain automation scripts and tools to streamline deployment, monitoring, and maintenance tasks. Contribute to the experimental sandbox environments for testing new AI solutions
Collaborate with cross-functional teams (Machine-Learning, HR, Security, Data Science, Developer Experience) to support the development and integration of AI solutions. Provide technical support and troubleshooting for AI-related issues
Adhere to security and privacy policies while deploying and managing AI tools. Ensure compliance with regulatory requirements
Implement comprehensive monitoring and metrics to track the performance and health of AI systems. Analyze data to identify areas for improvement and optimization
Participate in incident response and troubleshooting for AI-related outages or performance issues. Develop and maintain incident response plans
Contribute to backend development tasks to support the integration and functionality of AI tools
Deploy and manage AI solutions on public cloud platforms (AWS/GCP), leveraging cloud-native services and best practices
Excellent communication skills and experience presenting technical information to non-technical audiences, including senior leadership

AWSBackend DevelopmentDockerPythonBashCloud ComputingGCPJavaKubernetesGoCI/CDRESTful APIsTerraformAnsibleScripting

Posted 10 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 15 days ago

📍 United States

🧭 Full-Time

💸 161000.0 - 180000.0 USD per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

🔧 Requirements

STEM degree and/or relevant experience as a Site Reliability Engineer, Devops Engineer, or SWE
Proficiency in Python or Golang. Will also accept experience with other compiled or high level language: C, C++, Java, Rust, etc
Experience running Web applications at scale
Experience with Web application concepts and frameworks: ORM, MVC architecture, Django, Flask, Laravel, etc
Proficiency with Linux administration, Bash shell, and strong knowledge of Linux internals (e.g., filesystems, system calls)
Strong networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
Experience in database administration and configuration
Experience with DevOps tools such as Terraform, Ansible, Docker, and Kubernetes
Willingness to participate in on-call rotation and respond to monitoring and alerting of core website functions as needed

💡 Responsibilities

Analyze system performance using APM and distributed telemetry data to identify sources of instability
Improve scalability, reliability, and performance through software enhancements and patching
Develop tools and automation to streamline the DevOps pipeline
Design and manage infrastructure in both data center metal environments and in the public cloud
Conduct predictive failure analysis and disaster planning
Administer and configure databases and key-value stores with a focus on uptime and performance
Analyze complex systems to identify operational surprises and minimize downtime
Participate in incident response and produce postmortem reports
Collaborate with other engineering teams

AWSDockerPostgreSQLPythonBashKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNetworkingAnsibleDebugging

Posted 15 days ago

Apply

🔥 Lead, Site Reliability Engineer, Fabric

Posted 15 days ago

📍 United States, Canada

🧭 Full-Time

💸 147000.0 - 289000.0 USD per year

🔍 Software Development

🏢 Company: MongoDB👥 1001-5000💰 Post-IPO Equity about 7 years agoDatabase Open Source Cloud Computing SaaS Software

🔧 Requirements

10+ years of experience working on software and operating distributed systems, with deep expertise in networking fundamentals and a good understanding of how the internet works, e.g. TCP/IP (including IPv6), DNS, TLS/mTLS, BGP, tunnels, overlays, and SDN principles
2+ years of experience managing engineering teams, fostering a positive team culture and handling career growth and performance conversations
Intimately familiar with modern cloud-based infrastructure and the network design primitives of at least one of AWS, Azure, or GCP, e.g. VPCs, subnetting, routing, VPNs, peering, private link / private service connect, and CDNs
Strong knowledge of service mesh and load-balancing concepts, and be eager to implement these in a multi-cloud environment

💡 Responsibilities

Lead a team of engineers, setting direction, removing blockers, and ensuring alignment with organizational goals.
Oversee the development of a reliable and resilient multi-cloud globally-connected network that is crucial for MongoDB’s services
Collaborate with service-owning teams to provide internal support, addressing technical issues and offering guidance on best practices for service-to-service connectivity
Participate in a 24/7 on-call rotation to swiftly resolve issues related to network architecture and service-to-service connectivity, ensuring minimal disruption and high availability

AWSLeadershipGCPKubernetesPeople ManagementAzureCI/CDLinuxTerraformNetworkingExcellent communication skillsJSON

Posted 15 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 16 days ago

📍 United States, Canada

🧭 Full-Time

💸 180000.0 - 185000.0 USD per year

🔍 Financial Services

🏢 Company: Reach Financial👥 51-100 Financial Services Banking Payments

🔧 Requirements

5+ years of experience in Software Engineering or Site Reliability Engineering, focusing on scalability, automation, and reliability.
Strong coding skills in at least one language (Python, JavaScript/TypeScript, Go, or similar), with experience writing production-quality software.
Proficiency with CI/CD tools (GitHub Actions, ArgoCD, Jenkins, or similar), integrating security and testing into deployment pipelines.
Experience with containerization (Docker, OCI) and orchestration tools such as Kubernetes or AWS ECS.
Deep understanding of observability and monitoring using OpenTelemetry, Datadog, Prometheus, Grafana, or similar.
Hands-on experience with serverless architectures (AWS Lambda, Step Functions) and event-driven systems (SNS, SQS, Kafka).
Familiarity with Infrastructure as Code (IaC) (Terraform, OpenTofu, CloudFormation) and cloud-native architecture.
A collaborative mindset and excellent communication skills, working closely with Software Engineers, PDEs, and Platform teams to drive system reliability and performance.

💡 Responsibilities

Develop software that improves system reliability, scalability, and performance.
Collaborate with Software Engineers, SDETs, and Platform Teams to design highly available, fault-tolerant systems.
Build self-healing and auto-scaling systems to minimize manual intervention and eliminate toil.
Enhance observability by developing custom monitoring, logging, tracing, and alerting solutions.
Write production-quality code in Python, JavaScript/TypeScript, or other modern languages.
Design and enforce SLIs, SLOs, and error budgets in partnership with Product and Engineering teams.
Troubleshoot and resolve complex incidents, applying root cause analysis (RCA) and postmortem processes.
Optimize cloud infrastructure (AWS, Kubernetes, Lambda, EC2) for cost, performance, and availability.
Partner with Security teams to ensure compliance and best practices are integrated across all systems and processes.

AWSDockerPythonSQLCloud ComputingJavascriptKafkaKubernetesMySQLTypeScriptGoGrafanaPrometheusREST APIServerlessCommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsJSONAnsibleFinancial analysisSoftware Engineering

Posted 16 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 16 days ago

📍 United States

🧭 Full-Time

💸 120000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Bitwarden👥 101-250💰 $100,000,000 Series B over 2 years agoPrivacy Cyber Security Enterprise Software Identity Management Software

🔧 Requirements

Expertise with multi-region deployments in public cloud environments
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
Strong background in Reliability Engineering, DevOps, Software Engineering
Fluency with least one programming language, such as C#, Python, Go, etc
Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
Proficiency using source control such as Git.
Ability to maintain discretion, handle sensitive information, and improve security best-practices
Technocrat at heart, staying current with trends and new technologies
Collaborative and adaptable mindset
Openness and authenticity combined with excellent communication skills
Excitement and enthusiasm for open source and for better internet security
Excellent problem-solving skills – you might not know all the answers, but you know how to find and communicate the possible solutions

💡 Responsibilities

Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
Architectural designs and engineering operations at scale
Active participation in code reviews, learning and spreading technical knowledge
Contribute and mature incident management/escalation processes
Collaborate with cross functional teams to refine priorities and deliverables
Ongoing engagement with product owners to align SLI/SLOs/SLAs
Evaluate and identify opportunities for new initiatives to support organizational needs
Evolve and influence Bitwarden’s SDLC as we scale
Provide mentorship to team mates

AWSDockerPythonCloud ComputingGitKubernetesGoCI/CDRESTful APIsLinuxDevOpsTerraformMicroservices

Posted 16 days ago

Apply