Senior Site Reliability Engineer

Posted 15 days agoViewed

💎 Seniority level: Senior

📍 Location: United States

💸 Salary: 120000.0 - 185000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Bitwarden👥 101-250💰 $100,000,000 Series B over 2 years agoPrivacy Cyber Security Enterprise Software Identity Management Software

🗣️ Languages: English

🪄 Skills: AWSDockerPythonCloud ComputingGitKubernetesGoCI/CDRESTful APIsLinuxDevOpsTerraformMicroservices

Requirements:

Expertise with multi-region deployments in public cloud environments
Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
Strong background in Reliability Engineering, DevOps, Software Engineering
Fluency with least one programming language, such as C#, Python, Go, etc
Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
Proficiency using source control such as Git.
Ability to maintain discretion, handle sensitive information, and improve security best-practices
Technocrat at heart, staying current with trends and new technologies
Collaborative and adaptable mindset
Openness and authenticity combined with excellent communication skills
Excitement and enthusiasm for open source and for better internet security
Excellent problem-solving skills – you might not know all the answers, but you know how to find and communicate the possible solutions

Responsibilities:

Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
Architectural designs and engineering operations at scale
Active participation in code reviews, learning and spreading technical knowledge
Contribute and mature incident management/escalation processes
Collaborate with cross functional teams to refine priorities and deliverables
Ongoing engagement with product owners to align SLI/SLOs/SLAs
Evaluate and identify opportunities for new initiatives to support organizational needs
Evolve and influence Bitwarden’s SDLC as we scale
Provide mentorship to team mates

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer - AWS

Posted about 20 hours ago

📍 United States

💸 135000.0 - 170000.0 USD per year

🔍 Fintech

🏢 Company: Kunai👥 51-100 Consulting Financial Services Information Technology FinTech Software

🔧 Requirements

6+ years in SRE, DevOps, or infrastructure roles, ideally supporting distributed systems or microservices
3+ years in AWS or other public cloud providers
Hands-on experience with SRE tooling: observability, alerting, incident response (e.g., New Relic, PagerDuty, Splunk, OpenTelemetry)
Proficiency in at least one programming language such as Go, Java, or Python
Experience with automation testing or performance testing tools like Selenium, Postman, Cucumber, JMeter, or similar
Strong understanding of CI/CD principles and experience with build and deployment automation

💡 Responsibilities

Build and maintain tools that improve the reliability, performance, and availability of platform services
Implement observability practices and help integrate SRE tooling such as New Relic, OpenTelemetry, and Splunk
Collaborate with development and infrastructure teams to resolve incidents, improve monitoring, and optimize system performance
Help automate and streamline deployment, recovery, and testing processes
Contribute to shared engineering patterns and support adoption across multiple domain teams

AWSPythonJavaJMeterKubernetesGoSeleniumCI/CDLinuxDevOpsMicroservices

Posted about 20 hours ago

Apply

🔥 Senior Site Reliability Engineer, Core AI Infrastructure

Posted 9 days ago

📍 USA

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

Proven experience as a Site Reliability Engineer (SRE) or similar role.
Strong understanding of AI technologies and platforms.
Experience with deploying and managing applications in a cloud environment (AWS/GCP).
Solid backend development experience with programming languages such as Python, Java, or Go.
Strong proficiency in managing and configuring public cloud services (AWS/GCP) for scalability and reliability.
Experience with automation tools and scripting (e.g., Ansible, Terraform, Bash, Python).
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Strong security and compliance understanding.
Experience working in a highly regulated environment
Experience in a fast-paced, high-growth company

💡 Responsibilities

Deploy, configure, and manage AI-powered employee productivity tools and in-house AI built solutions
Ensure high availability, reliability, and optimal performance of AI platforms and services. Implement monitoring, alerting, and incident response procedures
Design and implement scalable infrastructure to support the growing demands of AI tools and user base. Optimize resource utilization and manage capacity planning
Develop and maintain automation scripts and tools to streamline deployment, monitoring, and maintenance tasks. Contribute to the experimental sandbox environments for testing new AI solutions
Collaborate with cross-functional teams (Machine-Learning, HR, Security, Data Science, Developer Experience) to support the development and integration of AI solutions. Provide technical support and troubleshooting for AI-related issues
Adhere to security and privacy policies while deploying and managing AI tools. Ensure compliance with regulatory requirements
Implement comprehensive monitoring and metrics to track the performance and health of AI systems. Analyze data to identify areas for improvement and optimization
Participate in incident response and troubleshooting for AI-related outages or performance issues. Develop and maintain incident response plans
Contribute to backend development tasks to support the integration and functionality of AI tools
Deploy and manage AI solutions on public cloud platforms (AWS/GCP), leveraging cloud-native services and best practices
Excellent communication skills and experience presenting technical information to non-technical audiences, including senior leadership

AWSBackend DevelopmentDockerPythonBashCloud ComputingGCPJavaKubernetesGoCI/CDRESTful APIsTerraformAnsibleScripting

Posted 9 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 14 days ago

📍 United States

🧭 Full-Time

💸 161000.0 - 180000.0 USD per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

🔧 Requirements

STEM degree and/or relevant experience as a Site Reliability Engineer, Devops Engineer, or SWE
Proficiency in Python or Golang. Will also accept experience with other compiled or high level language: C, C++, Java, Rust, etc
Experience running Web applications at scale
Experience with Web application concepts and frameworks: ORM, MVC architecture, Django, Flask, Laravel, etc
Proficiency with Linux administration, Bash shell, and strong knowledge of Linux internals (e.g., filesystems, system calls)
Strong networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
Experience in database administration and configuration
Experience with DevOps tools such as Terraform, Ansible, Docker, and Kubernetes
Willingness to participate in on-call rotation and respond to monitoring and alerting of core website functions as needed

💡 Responsibilities

Analyze system performance using APM and distributed telemetry data to identify sources of instability
Improve scalability, reliability, and performance through software enhancements and patching
Develop tools and automation to streamline the DevOps pipeline
Design and manage infrastructure in both data center metal environments and in the public cloud
Conduct predictive failure analysis and disaster planning
Administer and configure databases and key-value stores with a focus on uptime and performance
Analyze complex systems to identify operational surprises and minimize downtime
Participate in incident response and produce postmortem reports
Collaborate with other engineering teams

AWSDockerPostgreSQLPythonBashKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNetworkingAnsibleDebugging

Posted 14 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 15 days ago

📍 United States, Canada

🧭 Full-Time

💸 180000.0 - 185000.0 USD per year

🔍 Financial Services

🏢 Company: Reach Financial👥 51-100 Financial Services Banking Payments

🔧 Requirements

5+ years of experience in Software Engineering or Site Reliability Engineering, focusing on scalability, automation, and reliability.
Strong coding skills in at least one language (Python, JavaScript/TypeScript, Go, or similar), with experience writing production-quality software.
Proficiency with CI/CD tools (GitHub Actions, ArgoCD, Jenkins, or similar), integrating security and testing into deployment pipelines.
Experience with containerization (Docker, OCI) and orchestration tools such as Kubernetes or AWS ECS.
Deep understanding of observability and monitoring using OpenTelemetry, Datadog, Prometheus, Grafana, or similar.
Hands-on experience with serverless architectures (AWS Lambda, Step Functions) and event-driven systems (SNS, SQS, Kafka).
Familiarity with Infrastructure as Code (IaC) (Terraform, OpenTofu, CloudFormation) and cloud-native architecture.
A collaborative mindset and excellent communication skills, working closely with Software Engineers, PDEs, and Platform teams to drive system reliability and performance.

💡 Responsibilities

Develop software that improves system reliability, scalability, and performance.
Collaborate with Software Engineers, SDETs, and Platform Teams to design highly available, fault-tolerant systems.
Build self-healing and auto-scaling systems to minimize manual intervention and eliminate toil.
Enhance observability by developing custom monitoring, logging, tracing, and alerting solutions.
Write production-quality code in Python, JavaScript/TypeScript, or other modern languages.
Design and enforce SLIs, SLOs, and error budgets in partnership with Product and Engineering teams.
Troubleshoot and resolve complex incidents, applying root cause analysis (RCA) and postmortem processes.
Optimize cloud infrastructure (AWS, Kubernetes, Lambda, EC2) for cost, performance, and availability.
Partner with Security teams to ensure compliance and best practices are integrated across all systems and processes.

AWSDockerPythonSQLCloud ComputingJavascriptKafkaKubernetesMySQLTypeScriptGoGrafanaPrometheusREST APIServerlessCommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsJSONAnsibleFinancial analysisSoftware Engineering

Posted 15 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 21 days ago

📍 United States, Canada

🧭 Full-Time

💸 170000.0 - 210000.0 USD per year

🔍 Software Development

🔧 Requirements

5+ years running production workloads on AWS (or GCP/Azure) with infrastructure-as-code (Terraform/CDK/CloudFormation)
Hands-on experience operating container orchestration (ECS, EKS, Kubernetes, Nomad, etc.) and designing blue/green or canary rollouts
Depth in at least two of our core datastores (Postgres, MongoDB, Kafka) including backup/restore, upgrades, and performance tuning
Fluency with CI/CD pipelines (we use Buildkite + GitHub Actions) and a knack for automating everything with shell, Python, or TypeScript
Proven track record setting up monitoring/alerting in Datadog, Prometheus, or similar, with clear SLO/SLA ownership
Strong grasp of linux networking, load balancing (Cloudflare/ELB), and CDN/edge-security concepts
Excellent incident-management and root-cause analysis skills; able to write crisp RCAs and follow through on action items
Passion for customer-centric thinking, rapid iteration, and continuous learning

💡 Responsibilities

Set SLOs/SLIs, build self-healing architectures, and drive incident-prevention projects that keep our APIs and real-time ordering flows <100 ms p95.
Level-up dashboards, alerts, and distributed tracing so teams can detect issues before customers do.
Evolve our Buildkite pipelines and Terraform modules to give engineers <10-minute, one-click rollouts (and clean rollbacks).
Harden infra with least-privilege IAM, threat-model topology changes, and guide SOC 2 / PCI efforts.
Tune Postgres for multi-TB workloads, maintain Mongo sharding, and shepherd Kafka topic management as event volume climbs.
Rotate with the on-call SREs, run blameless post-mortems, and convert findings into durable fixes.
Pair with product engineers on capacity reviews, guide junior devs on Docker best-practices, and evangelize “you build it, you run it.”

AWSDockerNode.jsPostgreSQLPythonBashKafkaMongoDBReact NativeTypeScriptVue.JsNest.jsReactCI/CDLinuxDevOpsTerraformJSON

Posted 21 days ago

Apply

🔥 Senior Site Reliability Engineer ( Remote - US)

Posted about 1 month ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

🔧 Requirements

Minimum of 5 years of experience in SRE, DevOps, or Infrastructure Engineering, demonstrating strong ownership and problem-solving skills.
Proficiency in Kubernetes, Helm, and networking security practices.
In-depth experience with AWS services such as RDS, Aurora, VPC, EKS, EC2, and IAM.
Expertise in PostgreSQL administration, including performance tuning and high availability management within AWS.
Familiarity with CI/CD tools like GitHub Actions and ArgoCD, with a focus on automation and security best practices.
Strong understanding and experience in Infrastructure as Code (IaC) using Crossplane and Terraform.
Experience in observability and monitoring with Datadog.
Proficiency in Python and Bash scripting for system automation and management.
Strong communication skills and the ability to collaborate effectively across engineering teams and document processes in Confluence.

💡 Responsibilities

Own initiatives related to system reliability and scalability, identifying potential issues and implementing proactive solutions to prevent them.
Participate in on-call rotations, responding to incidents, performing root cause analysis, and driving long-term fixes.
Design, deploy, and manage Kubernetes clusters, utilizing tools like Helm charts, Cilium, and Karpenter to optimize both performance and cost.
Architect and maintain AWS infrastructure, focusing on RDS/Aurora PostgreSQL, networking, and scaling best practices.
Automate infrastructure provisioning using tools like Crossplane and Terraform to maintain consistency and scalability.
Enhance observability by improving monitoring systems using Datadog and drive proactive detection and resolution of system issues.
Conduct post-incident reviews and document lessons learned, driving improvements into long-term system practices.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesCI/CDTerraformNetworkingScriptingConfluence

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer, Development Infrastructure

Posted about 1 month ago

📍 North America

🧭 Full-Time

💸 118000.0 - 231000.0 USD per year

🔍 Software Development

🏢 Company: MongoDB👥 1001-5000💰 Post-IPO Equity about 7 years agoDatabase Open Source Cloud Computing SaaS Software

🔧 Requirements

6+ years of experience building and operating developer-facing infrastructure or platform tooling
Comfortable working with Terraform, and have experience managing Terraform workflows in a team or multi-environment setup
Hands-on experience with GitHub Actions, including writing reusable and maintainable workflows
Experience with Bazel (or similar build systems like Buck, Buck2 or Pants) and care about build performance, caching, and reproducibility
Experience with cloud infrastructure (AWS preferred) and familiarity with concepts like IAM, VPCs, OpenID Connect and CI/CD

💡 Responsibilities

Take ownership of core components of our Terraform and Bazel workflows
Propose and implement improvements to our infrastructure CI/CD and build pipelines
Contribute to a roadmap that makes it easier for teams to safely onboard infrastructure changes and scale workflows
Participate in our on-call rotation, focused on infrastructure tooling reliability supporting production services

AWSCloud ComputingKubernetesCI/CDLinuxDevOpsTerraform

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer - Data (REMOTE)

Posted about 1 month ago

📍 OR, WA, CA, CO, TX, IL

🧭 Full-Time

💸 130000.0 - 140000.0 USD per year

🔍 Software Development

🏢 Company: Discogs👥 51-100💰 $2,500,000 over 7 years agoDatabase Communities Music

🔧 Requirements

5+ years of experience working with Kafka and relational database management systems (RDBMS)
Relational database schema design, query performance optimization, administration (MySQL, Percona Server, AWS RDS)
Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC)
CI/CD (GitHub Actions)
GitOps (ArgoCD)
Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests)
AWS and cloud development (VPC, EKS, RDS, S3)
Observability (Datadog, Sentry)
Scripting (Shell, Python)

💡 Responsibilities

Stewarding Discogs’ data stores as a key subject matter expert
Leading efforts on the reliability and design patterns of our Kafka and Kafka Connect implementations
Establishing data contracts and clear communication standards between CDC producers and consumers
Working closely with engineering squads to refactor and re-architect MySQL database schema and indexing for long-term scalability, performance, and cost effectiveness
Mentoring engineering squads on Platform best practices for MySQL, Kafka, and other software development lifecycle areas
Writing documentation and runbooks that contribute to the engineering organization’s knowledge base
Working in a containerized, orchestrated environment
Contributing to the Platform team’s disciplines of site reliability and operations, supporting both our squads and Platform’s central infrastructure
Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issues

AWSPythonElasticSearchGitKafkaKubernetesMySQLFastAPIRDBMSREST APICI/CDMentoringTerraformDocumentationTroubleshootingScripting

Posted about 1 month ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 United States

🧭 Full-Time

💸 120000.0 - 125000.0 USD per year

🔍 Software Development

🔧 Requirements

Extensive expertise building and deploying web apps in AWS, Azure, and GCP Networking
Distributed systems
Public cloud and container security (RBAC, process isolation, network security, firewalls, certificate management, etc.)
Reliability engineering (disaster resilience, multi-zonal deployments, logging practices, SLOs/SLIs, monitoring, deployment strategy, etc.)
Kubernetes
Docker/containers
Terraform
Python
Version control systems (we use Git/GitHub)
Linux
DevOps concepts and best practices
Authentication technologies such as OIDC, SAML

💡 Responsibilities

Participate in the development of CiviForm products as a service, building upon our existing deployment system and building out a new Kubernetes-based prototype to ensure robust, secure, and scalable production instances.
Manage staging and production environments, being on call to address outages
Work with governments with issues related to the service
Own and evolve the deployment systems
Participate in the development of a new CiviForm SaaS (Software as a Service).
Own development of this deployment system utilizing Kubernetes from prototyping through to delivery.
Civiform’s existing infrastructure is currently defined with Python and Terraform, deployed into AWS and Azure. Improve the flexibility and features of the system to meet the needs of governments deploying CiviForm to their own cloud providers.
Define, implement, gather, and analyze metrics from deployments to identify areas for improvement related to cloud configuration
Partner with the engineering team to improve services through rigorous testing and release procedures, as well as resolving scaling issues and improving resilience
Draft Service Level Objectives and define Service Level Indicators, and implement them
Develop playbooks for deployments, including implementing a strategy for monitoring and alerting and how to address issuesIdentify and mitigate security risks in deployments
Contribute to CI/CD implementation and best practices

AWSDockerPythonGCPKubernetesAzureCI/CDLinuxDevOpsTerraformMicroservicesNetworking

Posted about 2 months ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 2 months ago

📍 UK, Americas, EU

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Understanding of network security and application of zero-trust principles.
Must know how to secure Mac and Linux endpoints.
Software supply chain security knowledge and experience will be desired, along with an understanding of vulnerability and secret scanning within CI/CD pipelines.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 2 months ago

Apply