Apply

Senior Site Reliability Engineer

Posted 2024-12-03

View full description

💎 Seniority level: Senior

📍 Location: United States

💸 Salary: 140000 - 160000 USD per year

🔍 Industry: Cybersecurity

🏢 Company: Bitwarden👥 101-250💰 $100.0m Series B on 2022-09-06PrivacyCyber SecurityEnterprise SoftwareIdentity ManagementSoftware

🗣️ Languages: English

🪄 Skills: PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Requirements:
  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise in multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency in at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies like GitOps, Terraform, or Pulumi.
  • Proficiency in using source control such as Git.
  • Ability to maintain discretion and improve security best practices.
  • Interest in new technologies and trends.
  • Collaborative and adaptable mindset with excellent communication skills.
  • Passion for open source and internet security.
  • Excellent problem-solving skills.
Responsibilities:
  • Take ownership of the Bitwarden cloud infrastructure, focusing on user satisfaction.
  • Evaluate current infrastructure regularly and make recommendations for reliability, security, and cost management.
  • Implement site reliability tools, monitoring, and observability across cloud environments.
  • Respond to infrastructure outages and contribute to 24x7 support strategy.
  • Engage in architectural designs and engineering operations at scale.
  • Participate in code reviews and share knowledge.
  • Contribute to incident management processes.
  • Collaborate with cross-functional teams on priorities and deliverables.
  • Align SLI/SLOs/SLAs with product owners.
  • Identify new initiatives for organizational needs.
  • Influence Bitwarden’s SDLC as it scales.
  • Mentor team members.
Apply

Related Jobs

Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

💸 109047 - 169455 USD per year

🔍 Nonprofit Organization, Technology

🏢 Company: Wikimedia Foundation👥 251-500💰 $2.1m Grant on 2020-01-01

  • At least two years of experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Knowledge of configuration management and orchestration tools (e.g., Puppet, Ansible).
  • Familiarity with observability infrastructure (monitoring, metrics, logging).
  • Proficient in shell and scripting languages (e.g., Python, Go, Bash, Ruby).
  • Understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • Deployment, configuration, and maintenance of distributed data systems for the data and analytics platform.
  • Implement data quality monitoring to alert the team of possible data issues.
  • Collaborate with Fundraising to integrate data from various self-hosted and third-party sources.
  • Provide engineering support during high-traffic campaigns.
  • Document internal systems and processes.
  • Ensure compliance with relevant regulations, such as Donor Privacy Policy, GDPR, and PCI DSS.
  • Manage users and permissions for data access control.
  • Advise on best practices for data input and streamline processes.

PythonBashRubyData engineeringGoCommunication SkillsCollaborationLinuxDevOpsDocumentationCompliance

Posted 2024-12-03
Apply
Apply

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.

  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2024-11-24
Apply
Apply

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.

  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyAlgorithmsData StructuresGolangCommunication SkillsJavaScriptLinuxTerraform

Posted 2024-11-07
Apply
Apply

📍 United States

🧭 Full-Time

💸 150000 - 230000 USD per year

🔍 Public safety technology

🏢 Company: Axon

  • This position involves handling of classified federal data; under federal regulations, it is open to US Citizens only.
  • 10+ years of applicable experience.
  • Experience managing cloud platforms such as Azure, AWS, or similar.
  • Experience operating in Kubernetes platforms like AKS, EKS, or similar.
  • Experience using managed languages such as Python, Go, C#, Java, or similar.
  • Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases.
  • Experience using observability tools such as APM, logging, and metrics to assist with debugging issues.
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems.
  • Familiarity with building flexible and testable Infrastructure as Code modules.
  • Empathy to support the needs of software engineers.

  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely.
  • Exemplify cloud-native site reliability best practices.
  • Write code that is performant, maintainable, clear, and concise.
  • Employ strong problem-solving skills to debug problems in cloud-native distributed systems.
  • Influence and educate the engineering organization to adopt new and improved architectural patterns.
  • Provide robust documentation for use by engineers to promote self-service.
  • Take calculated risks, champion new ideas, and cultivate your craft.

AWSPythonJavaKubernetesC#AzureGoCI/CD

Posted 2024-11-07
Apply
Apply

📍 US, Portugal

🧭 Full-Time

🔍 Health Technology

  • Proficiency in programming languages such as Python, Go, Javascript.
  • 5+ years of experience with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong understanding of Linux/Unix systems and networking.
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Proficiency with relational and NoSQL databases (e.g., MySQL, PostgreSQL, Redis, Elasticsearch).
  • Willingness to collaborate and share knowledge with colleagues.
  • Ability to take responsibility for work and demonstrate accountability.

  • Develop and maintain monitoring and alerting solutions.
  • Respond to incidents, troubleshoot issues, and perform root cause analysis.
  • Automate repetitive tasks and improve deployment processes.
  • Develop and maintain tools to support infrastructure and applications.
  • Analyze system performance and implement optimizations to improve efficiency and reduce latency.
  • Ensure systems are secure and compliant with relevant standards and regulations.
  • Maintain comprehensive documentation of systems and processes.
  • Share knowledge and best practices with team members.
  • Ensure the reliability, performance, and scalability of databases.
  • Perform database optimization, maintenance, and troubleshooting.

AWSDockerPostgreSQLPythonElasticSearchJavascriptJenkinsKubernetesMySQLAzureElasticsearchGoGrafanaPrometheusRedisNosqlCI/CDJavaScript

Posted 2024-11-07
Apply
Apply

📍 USA

💸 160000 - 195000 USD per year

🔍 Healthcare

🏢 Company: Garner Health👥 51-100💰 $45.0m Series B on 2021-12-14Business IntelligenceBig DataMedicalEmployee BenefitsHealth Care

  • 5+ years of experience delivering software solutions.
  • 4+ years of hands-on production work with cloud infrastructure, containers, monitoring, and alerting.
  • 3+ years working in a security-conscious environment.
  • Expertise and experience leading cloud-first/only projects, preferably on AWS.
  • Expertise with Terraform.
  • Experience with Kubernetes.
  • Experience with Go and Python, particularly utilizing Kubernetes APIs.

  • Architect, operate, improve, and secure the platform the Garner Health app runs on.
  • Boost developer productivity.
  • Build systems to a high engineering standard and ensure others adhere to these standards.
  • Research and advocate for improved techniques, processes, and designs.
  • Collaborate with teammates on strategic platform initiatives.
  • Support the Garner platform in production.
  • Ensure security in production according to regulatory requirements.
  • Partner with stakeholders to maintain product availability and performance.

AWSPythonKubernetesGoTerraform

Posted 2024-10-21
Apply
Apply

📍 United States

💸 $161,000 - $180,000 per year

🔍 Adult entertainment

🏢 Company: Multi Media LLC

  • STEM degree and relevant experience as a Site Reliability Engineer
  • Exceptional problem solving skills
  • High proficiency in one of the following: C, C++, Java, Python, Go, etc.
  • High proficiency in Unix/Linux environment, excellent knowledge of internals (e.g., filesystems, system calls)
  • Networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments
  • Experience in database administration and configuration
  • Experience with DevOps tools such as Terraform, Ansible, Docker, Kubernetes
  • On call reporting to monitoring and alerting of core website functions as needed

  • Performance analysis to identify sources of instability using data from APM and distributed telemetry data tools
  • Analyze complex systems to identify operational surprises and minimize downtime.
  • Software engineering and patching in to incrementally improve performance, scalability, and reliability
  • Infrastructure modifications in both a data center metal environment with advanced routing/switching and in the public cloud
  • Predictive failure analysis and disaster planning
  • Author new tools and automation to streamline the DevOps pipeline
  • Collaborate with other engineering teams
  • Database and kv store administration and configuration with a focus on uptime and performance
  • Incident response and postmortem reports

DockerPythonJavaKubernetesTerraform

Posted 2024-10-05
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 $139,000 - $218,000 per year

🔍 Web Development

  • Either a background as an ops engineer with an enthusiasm for code, or a background as a software engineer with an enthusiasm for systems administration.
  • 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
  • Experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
  • Experience with container-centric architectures, built with Docker and tools like Kubernetes (EKS, GKE, AKS, OpenShift, etc.), ECS, Docker Swarm, or Mesos.
  • Experience with infrastructure-as-code tools like Terraform, Pulumi, Ansible, Puppet, or Chef.
  • Experience in contributing to full-stack applications built using tools like React, Node, and MongoDB.
  • Enthusiasm for mentoring and sponsoring less-experienced engineers.

  • Empower engineers on other teams to take control of their services by maintaining monitoring tooling and collaborating on internal best practices for observability.
  • Enhance reliability of applications running in Kubernetes by optimizing resource allocation, streamlining upgrade processes, and ensuring scalability and fault tolerance.
  • Occasionally dive into the main Webflow application in Node, Python, or Go to better discern (and sometimes fix) behavior in production.
  • Work with peers on Webflow’s Customer Support, Partnerships, and Sales teams to enable customers using Webflow’s services in production.
  • Participate in and continuously improve on-call and incident response processes.

AWSDockerPythonGCPKubernetesMongoDBGoReact

Posted 2024-09-19
Apply
Apply

📍 North America

🧭 Full-Time

🔍 Incident Management Platform

🏢 Company: Rootly

  • You have 5+ years of experience in an SRE or Infrastructure Engineering role.
  • 5+ years of experience writing software as a SWE or Software heavy SRE role.
  • You have strong technical knowledge of cloud infrastructure, distributed systems, and reliability practices.
  • You’ve supported services at web or RPC services at a significant scale.
  • You have experience solving infrastructure problems by writing software.
  • You have a big-picture perspective on systems and tools.
  • You can collaborate with other Engineering teams to understand their systems and help to improve them.

  • Participate in an on-call rotation to support critical Rootly services, and in some cases be on call with software teams.
  • Participate in the definition and management of SLOs and error budgets for the Engineering teams that own services in production.
  • Build tools to support our processes.
  • Embed with feature delivery software teams to build and enhance observability, reliability, and availability of those services.
  • Work with other teams around Engineering to understand their systems and their challenges at the code level and identify improvements in Rootly Infrastructure to improve the services they own (contribute code where possible).

AWSBackend DevelopmentSoftware DevelopmentCloud ComputingGitKubernetesAmazon Web ServicesCI/CD

Posted 2024-09-13
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 109047 - 169455 USD per year

🔍 Nonprofit / Technology

  • At least two years experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Comfortable with configuration management and orchestration tools (e.g., Puppet, Ansible, Chef, SaltStack).
  • Knowledge of modern observability infrastructure (monitoring, metrics, and logging).
  • Proficient in shell and scripting languages such as Python, Go, Bash, Ruby.
  • Good understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • The Deployment, configuration and maintenance of the distributed data systems that comprise our data and analytics platform.
  • Implement data quality monitoring that alerts the team of possible data issues.
  • Collaborate closely with the Fundraising team to integrate and use data from self-hosted and third-party sources.
  • Provide engineering support during high-traffic or critical campaigns.
  • Write and update internal documentation of systems and processes.
  • Ensure compliance with regulations like the Donor Privacy Policy, GDPR, and PCI DSS.
  • Create and manage users and permissions for data access control.
  • Advise on data input best practices and develop processes for data entry consistency.
  • Work closely with Fundraising Analytics to gather and prioritize data enhancement requests.

PythonBashRubyData engineeringGoCommunication SkillsCollaborationC (Programming language)

Posted 2024-08-22
Apply