Apply

Senior Site Reliability Engineer

Posted 15 days agoViewed

View full description

💎 Seniority level: Senior

📍 Location: United States

💸 Salary: 120000.0 - 185000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Bitwarden👥 101-250💰 $100,000,000 Series B over 2 years agoPrivacyCyber SecurityEnterprise SoftwareIdentity ManagementSoftware

🗣️ Languages: English

🪄 Skills: AWSDockerPythonCloud ComputingGitKubernetesGoCI/CDRESTful APIsLinuxDevOpsTerraformMicroservices

Requirements:
  • Expertise with multi-region deployments in public cloud environments
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc)
  • Strong background in Reliability Engineering, DevOps, Software Engineering
  • Fluency with least one programming language, such as C#, Python, Go, etc
  • Experience with cloud deployment and automation tools/methodologies (i.e. GitOps, Terraform, Pulumi)
  • Proficiency using source control such as Git.
  • Ability to maintain discretion, handle sensitive information, and improve security best-practices
  • Technocrat at heart, staying current with trends and new technologies
  • Collaborative and adaptable mindset
  • Openness and authenticity combined with excellent communication skills
  • Excitement and enthusiasm for open source and for better internet security
  • Excellent problem-solving skills – you might not know all the answers, but you know how to find and communicate the possible solutions
Responsibilities:
  • Take ownership of the Bitwarden cloud infrastructure, with an emphasis on quality that translates directly to user delight
  • Evaluate current infrastructure and, on a regular basis, make recommendations for reliability, security, availability, scalability and cost management
  • Implement site reliability tools, monitoring, early warning and alert systems, and observability across Bitwarden cloud environments
  • Respond to infrastructure based outages; participate and contribute to ongoing strategy for 24x7 support (There is an on-call rotation with a weekend shift every 5-6 weeks)
  • Architectural designs and engineering operations at scale
  • Active participation in code reviews, learning and spreading technical knowledge
  • Contribute and mature incident management/escalation processes
  • Collaborate with cross functional teams to refine priorities and deliverables
  • Ongoing engagement with product owners to align SLI/SLOs/SLAs
  • Evaluate and identify opportunities for new initiatives to support organizational needs
  • Evolve and influence Bitwarden’s SDLC as we scale
  • Provide mentorship to team mates
Apply

Related Jobs

Apply

📍 United States

💸 135000.0 - 170000.0 USD per year

🔍 Fintech

🏢 Company: Kunai👥 51-100ConsultingFinancial ServicesInformation TechnologyFinTechSoftware

  • 6+ years in SRE, DevOps, or infrastructure roles, ideally supporting distributed systems or microservices
  • 3+ years in AWS or other public cloud providers
  • Hands-on experience with SRE tooling: observability, alerting, incident response (e.g., New Relic, PagerDuty, Splunk, OpenTelemetry)
  • Proficiency in at least one programming language such as Go, Java, or Python
  • Experience with automation testing or performance testing tools like Selenium, Postman, Cucumber, JMeter, or similar
  • Strong understanding of CI/CD principles and experience with build and deployment automation
  • Build and maintain tools that improve the reliability, performance, and availability of platform services
  • Implement observability practices and help integrate SRE tooling such as New Relic, OpenTelemetry, and Splunk
  • Collaborate with development and infrastructure teams to resolve incidents, improve monitoring, and optimize system performance
  • Help automate and streamline deployment, recovery, and testing processes
  • Contribute to shared engineering patterns and support adoption across multiple domain teams

AWSPythonJavaJMeterKubernetesGoSeleniumCI/CDLinuxDevOpsMicroservices

Posted about 20 hours ago
Apply
Apply

📍 USA

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

  • Proven experience as a Site Reliability Engineer (SRE) or similar role.
  • Strong understanding of AI technologies and platforms.
  • Experience with deploying and managing applications in a cloud environment (AWS/GCP).
  • Solid backend development experience with programming languages such as Python, Java, or Go.
  • Strong proficiency in managing and configuring public cloud services (AWS/GCP) for scalability and reliability.
  • Experience with automation tools and scripting (e.g., Ansible, Terraform, Bash, Python).
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and collaboration skills.
  • Strong security and compliance understanding.
  • Experience working in a highly regulated environment
  • Experience in a fast-paced, high-growth company
  • Deploy, configure, and manage AI-powered employee productivity tools and in-house AI built solutions
  • Ensure high availability, reliability, and optimal performance of AI platforms and services. Implement monitoring, alerting, and incident response procedures
  • Design and implement scalable infrastructure to support the growing demands of AI tools and user base. Optimize resource utilization and manage capacity planning
  • Develop and maintain automation scripts and tools to streamline deployment, monitoring, and maintenance tasks. Contribute to the experimental sandbox environments for testing new AI solutions
  • Collaborate with cross-functional teams (Machine-Learning, HR, Security, Data Science, Developer Experience) to support the development and integration of AI solutions. Provide technical support and troubleshooting for AI-related issues
  • Adhere to security and privacy policies while deploying and managing AI tools. Ensure compliance with regulatory requirements
  • Implement comprehensive monitoring and metrics to track the performance and health of AI systems. Analyze data to identify areas for improvement and optimization
  • Participate in incident response and troubleshooting for AI-related outages or performance issues. Develop and maintain incident response plans
  • Contribute to backend development tasks to support the integration and functionality of AI tools
  • Deploy and manage AI solutions on public cloud platforms (AWS/GCP), leveraging cloud-native services and best practices
  • Excellent communication skills and experience presenting technical information to non-technical audiences, including senior leadership

AWSBackend DevelopmentDockerPythonBashCloud ComputingGCPJavaKubernetesGoCI/CDRESTful APIsTerraformAnsibleScripting

Posted 9 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 161000.0 - 180000.0 USD per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

  • STEM degree and/or relevant experience as a Site Reliability Engineer, Devops Engineer, or SWE
  • Proficiency in Python or Golang. Will also accept experience with other compiled or high level language: C, C++, Java, Rust, etc
  • Experience running Web applications at scale
  • Experience with Web application concepts and frameworks: ORM, MVC architecture, Django, Flask, Laravel, etc
  • Proficiency with Linux administration, Bash shell, and strong knowledge of Linux internals (e.g., filesystems, system calls)
  • Strong networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
  • Experience in database administration and configuration
  • Experience with DevOps tools such as Terraform, Ansible, Docker, and Kubernetes
  • Willingness to participate in on-call rotation and respond to monitoring and alerting of core website functions as needed
  • Analyze system performance using APM and distributed telemetry data to identify sources of instability
  • Improve scalability, reliability, and performance through software enhancements and patching
  • Develop tools and automation to streamline the DevOps pipeline
  • Design and manage infrastructure in both data center metal environments and in the public cloud
  • Conduct predictive failure analysis and disaster planning
  • Administer and configure databases and key-value stores with a focus on uptime and performance
  • Analyze complex systems to identify operational surprises and minimize downtime
  • Participate in incident response and produce postmortem reports
  • Collaborate with other engineering teams

AWSDockerPostgreSQLPythonBashKubernetesCI/CDRESTful APIsLinuxDevOpsTerraformNetworkingAnsibleDebugging

Posted 14 days ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 180000.0 - 185000.0 USD per year

🔍 Financial Services

🏢 Company: Reach Financial👥 51-100Financial ServicesBankingPayments

  • 5+ years of experience in Software Engineering or Site Reliability Engineering, focusing on scalability, automation, and reliability.
  • Strong coding skills in at least one language (Python, JavaScript/TypeScript, Go, or similar), with experience writing production-quality software.
  • Proficiency with CI/CD tools (GitHub Actions, ArgoCD, Jenkins, or similar), integrating security and testing into deployment pipelines.
  • Experience with containerization (Docker, OCI) and orchestration tools such as Kubernetes or AWS ECS.
  • Deep understanding of observability and monitoring using OpenTelemetry, Datadog, Prometheus, Grafana, or similar.
  • Hands-on experience with serverless architectures (AWS Lambda, Step Functions) and event-driven systems (SNS, SQS, Kafka).
  • Familiarity with Infrastructure as Code (IaC) (Terraform, OpenTofu, CloudFormation) and cloud-native architecture.
  • A collaborative mindset and excellent communication skills, working closely with Software Engineers, PDEs, and Platform teams to drive system reliability and performance.
  • Develop software that improves system reliability, scalability, and performance.
  • Collaborate with Software Engineers, SDETs, and Platform Teams to design highly available, fault-tolerant systems.
  • Build self-healing and auto-scaling systems to minimize manual intervention and eliminate toil.
  • Enhance observability by developing custom monitoring, logging, tracing, and alerting solutions.
  • Write production-quality code in Python, JavaScript/TypeScript, or other modern languages.
  • Design and enforce SLIs, SLOs, and error budgets in partnership with Product and Engineering teams.
  • Troubleshoot and resolve complex incidents, applying root cause analysis (RCA) and postmortem processes.
  • Optimize cloud infrastructure (AWS, Kubernetes, Lambda, EC2) for cost, performance, and availability.
  • Partner with Security teams to ensure compliance and best practices are integrated across all systems and processes.

AWSDockerPythonSQLCloud ComputingJavascriptKafkaKubernetesMySQLTypeScriptGoGrafanaPrometheusREST APIServerlessCommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsJSONAnsibleFinancial analysisSoftware Engineering

Posted 15 days ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 170000.0 - 210000.0 USD per year

🔍 Software Development

  • 5+ years running production workloads on AWS (or GCP/Azure) with infrastructure-as-code (Terraform/CDK/CloudFormation)
  • Hands-on experience operating container orchestration (ECS, EKS, Kubernetes, Nomad, etc.) and designing blue/green or canary rollouts
  • Depth in at least two of our core datastores (Postgres, MongoDB, Kafka) including backup/restore, upgrades, and performance tuning
  • Fluency with CI/CD pipelines (we use Buildkite + GitHub Actions) and a knack for automating everything with shell, Python, or TypeScript
  • Proven track record setting up monitoring/alerting in Datadog, Prometheus, or similar, with clear SLO/SLA ownership
  • Strong grasp of linux networking, load balancing (Cloudflare/ELB), and CDN/edge-security concepts
  • Excellent incident-management and root-cause analysis skills; able to write crisp RCAs and follow through on action items
  • Passion for customer-centric thinking, rapid iteration, and continuous learning
  • Set SLOs/SLIs, build self-healing architectures, and drive incident-prevention projects that keep our APIs and real-time ordering flows <100 ms p95.
  • Level-up dashboards, alerts, and distributed tracing so teams can detect issues before customers do.
  • Evolve our Buildkite pipelines and Terraform modules to give engineers <10-minute, one-click rollouts (and clean rollbacks).
  • Harden infra with least-privilege IAM, threat-model topology changes, and guide SOC 2 / PCI efforts.
  • Tune Postgres for multi-TB workloads, maintain Mongo sharding, and shepherd Kafka topic management as event volume climbs.
  • Rotate with the on-call SREs, run blameless post-mortems, and convert findings into durable fixes.
  • Pair with product engineers on capacity reviews, guide junior devs on Docker best-practices, and evangelize “you build it, you run it.”

AWSDockerNode.jsPostgreSQLPythonBashKafkaMongoDBReact NativeTypeScriptVue.JsNest.jsReactCI/CDLinuxDevOpsTerraformJSON

Posted 21 days ago
Apply
Apply

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: Jobgether👥 11-50💰 $1,493,585 Seed about 2 years agoInternet

  • Minimum of 5 years of experience in SRE, DevOps, or Infrastructure Engineering, demonstrating strong ownership and problem-solving skills.
  • Proficiency in Kubernetes, Helm, and networking security practices.
  • In-depth experience with AWS services such as RDS, Aurora, VPC, EKS, EC2, and IAM.
  • Expertise in PostgreSQL administration, including performance tuning and high availability management within AWS.
  • Familiarity with CI/CD tools like GitHub Actions and ArgoCD, with a focus on automation and security best practices.
  • Strong understanding and experience in Infrastructure as Code (IaC) using Crossplane and Terraform.
  • Experience in observability and monitoring with Datadog.
  • Proficiency in Python and Bash scripting for system automation and management.
  • Strong communication skills and the ability to collaborate effectively across engineering teams and document processes in Confluence.
  • Own initiatives related to system reliability and scalability, identifying potential issues and implementing proactive solutions to prevent them.
  • Participate in on-call rotations, responding to incidents, performing root cause analysis, and driving long-term fixes.
  • Design, deploy, and manage Kubernetes clusters, utilizing tools like Helm charts, Cilium, and Karpenter to optimize both performance and cost.
  • Architect and maintain AWS infrastructure, focusing on RDS/Aurora PostgreSQL, networking, and scaling best practices.
  • Automate infrastructure provisioning using tools like Crossplane and Terraform to maintain consistency and scalability.
  • Enhance observability by improving monitoring systems using Datadog and drive proactive detection and resolution of system issues.
  • Conduct post-incident reviews and document lessons learned, driving improvements into long-term system practices.

AWSPostgreSQLPythonAmazon RDSAWS EKSBashKubernetesCI/CDTerraformNetworkingScriptingConfluence

Posted about 1 month ago
Apply
Apply

📍 North America

🧭 Full-Time

💸 118000.0 - 231000.0 USD per year

🔍 Software Development

🏢 Company: MongoDB👥 1001-5000💰 Post-IPO Equity about 7 years agoDatabaseOpen SourceCloud ComputingSaaSSoftware

  • 6+ years of experience building and operating developer-facing infrastructure or platform tooling
  • Comfortable working with Terraform, and have experience managing Terraform workflows in a team or multi-environment setup
  • Hands-on experience with GitHub Actions, including writing reusable and maintainable workflows
  • Experience with Bazel (or similar build systems like Buck, Buck2 or Pants) and care about build performance, caching, and reproducibility
  • Experience with cloud infrastructure (AWS preferred) and familiarity with concepts like IAM, VPCs, OpenID Connect and CI/CD
  • Take ownership of core components of our Terraform and Bazel workflows
  • Propose and implement improvements to our infrastructure CI/CD and build pipelines
  • Contribute to a roadmap that makes it easier for teams to safely onboard infrastructure changes and scale workflows
  • Participate in our on-call rotation, focused on infrastructure tooling reliability supporting production services

AWSCloud ComputingKubernetesCI/CDLinuxDevOpsTerraform

Posted about 1 month ago
Apply
Apply

📍 OR, WA, CA, CO, TX, IL

🧭 Full-Time

💸 130000.0 - 140000.0 USD per year

🔍 Software Development

🏢 Company: Discogs👥 51-100💰 $2,500,000 over 7 years agoDatabaseCommunitiesMusic

  • 5+ years of experience working with Kafka and relational database management systems (RDBMS)
  • Relational database schema design, query performance optimization, administration (MySQL, Percona Server, AWS RDS)
  • Kafka: Cluster administration (Strimzi), Kafka Connect (Debezium, JDBC)
  • CI/CD (GitHub Actions)
  • GitOps (ArgoCD)
  • Kubernetes (EKS, Kustomize, Karpenter, administration, application manifests)
  • AWS and cloud development (VPC, EKS, RDS, S3)
  • Observability (Datadog, Sentry)
  • Scripting (Shell, Python)
  • Stewarding Discogs’ data stores as a key subject matter expert
  • Leading efforts on the reliability and design patterns of our Kafka and Kafka Connect implementations
  • Establishing data contracts and clear communication standards between CDC producers and consumers
  • Working closely with engineering squads to refactor and re-architect MySQL database schema and indexing for long-term scalability, performance, and cost effectiveness
  • Mentoring engineering squads on Platform best practices for MySQL, Kafka, and other software development lifecycle areas
  • Writing documentation and runbooks that contribute to the engineering organization’s knowledge base
  • Working in a containerized, orchestrated environment
  • Contributing to the Platform team’s disciplines of site reliability and operations, supporting both our squads and Platform’s central infrastructure
  • Participating in on-call rotation, responding to incidents, and troubleshooting data and other operations issues

AWSPythonElasticSearchGitKafkaKubernetesMySQLFastAPIRDBMSREST APICI/CDMentoringTerraformDocumentationTroubleshootingScripting

Posted about 1 month ago
Apply
Apply
🔥 Senior Site Reliability Engineer
Posted about 2 months ago

📍 United States

🧭 Full-Time

💸 120000.0 - 125000.0 USD per year

🔍 Software Development

  • Extensive expertise building and deploying web apps in AWS, Azure, and GCP Networking
  • Distributed systems
  • Public cloud and container security (RBAC, process isolation, network security, firewalls, certificate management, etc.)
  • Reliability engineering (disaster resilience, multi-zonal deployments, logging practices, SLOs/SLIs, monitoring, deployment strategy, etc.)
  • Kubernetes
  • Docker/containers
  • Terraform
  • Python
  • Version control systems (we use Git/GitHub)
  • Linux
  • DevOps concepts and best practices
  • Authentication technologies such as OIDC, SAML
  • Participate in the development of CiviForm products as a service, building upon our existing deployment system and building out a new Kubernetes-based prototype to ensure robust, secure, and scalable production instances.
  • Manage staging and production environments, being on call to address outages
  • Work with governments with issues related to the service
  • Own and evolve the deployment systems
  • Participate in the development of a new CiviForm SaaS (Software as a Service).
  • Own development of this deployment system utilizing Kubernetes from prototyping through to delivery.
  • Civiform’s existing infrastructure is currently defined with Python and Terraform, deployed into AWS and Azure. Improve the flexibility and features of the system to meet the needs of governments deploying CiviForm to their own cloud providers.
  • Define, implement, gather, and analyze metrics from deployments to identify areas for improvement related to cloud configuration
  • Partner with the engineering team to improve services through rigorous testing and release procedures, as well as resolving scaling issues and improving resilience
  • Draft Service Level Objectives and define Service Level Indicators, and implement them
  • Develop playbooks for deployments, including implementing a strategy for monitoring and alerting and how to address issuesIdentify and mitigate security risks in deployments
  • Contribute to CI/CD implementation and best practices

AWSDockerPythonGCPKubernetesAzureCI/CDLinuxDevOpsTerraformMicroservicesNetworking

Posted about 2 months ago
Apply
Apply

📍 UK, Americas, EU

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

  • An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
  • Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
  • A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
  • Knowledge and experience in managing configuration at scale.
  • Experience with CI/CD pipeline, version control best practices.
  • Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
  • Strong knowledge of cloud security and IAM policies is required.
  • SIEM and threat management experience.
  • Understanding of network security and application of zero-trust principles.
  • Must know how to secure Mac and Linux endpoints.
  • Software supply chain security knowledge and experience will be desired, along with an understanding of vulnerability and secret scanning within CI/CD pipelines.
  • Participate in on-call roster to support our trading operations.
  • Maintain and improve our global infrastructure with high performance and reliability requirements.
  • Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
  • Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
  • Development of internal tools and automation to accomplish the team’s goals.
  • Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
  • Active participation in various trading and infrastructure projects.
  • Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 2 months ago
Apply