Apply

Site Reliability Engineer

Posted 2024-11-26

View full description

💎 Seniority level: Senior, 3+ years

📍 Location: USA

💸 Salary: 127500 - 153000 USD per year

🔍 Industry: IT and Security

🏢 Company: Automox👥 101-250💰 $110.0m Series C on 2021-04-27🫂 on 2022-04-13Cloud ManagementSoftwareCloud Infrastructure

🗣️ Languages: English

⏳ Experience: 3+ years

🪄 Skills: AWSDockerPythonJenkinsKafkaKubernetesRubyGoCollaborationTerraform

Requirements:
  • Demonstrated track record of maintaining and building large scale systems.
  • 3+ years SRE, Production or Systems Engineering experience.
  • Experience with AWS products, including Aurora Database and App Mesh.
  • Strong in at least one of these languages Go, Elixir, Python, or Ruby.
  • Proficient in scripting and deployment automation using Jenkins, Github Actions, or ArgoCD.
  • Strong focus on security using Datadog, Cloudflare, Kyverno, and Vault.
  • Experience with systems such as Kubernetes (EKS & KOPS), Docker, and Kafka.
  • Experience managing public cloud infrastructure.
Responsibilities:
  • Support the growth of feature teams by embedding with engineering teams and providing design expertise.
  • Advocate for best practices and influence decision-making from design phase to production.
  • Serve as the Subject Matter Expert (SME) for Kubernetes and enable developers to deliver reliable applications.
  • Cultivate a positive team culture and contribute to a collaborative environment.
Apply

Related Jobs

Apply

📍 United States

🧭 Full-Time

💸 140000 - 160000 USD per year

🔍 Cybersecurity

🏢 Company: Bitwarden👥 101-250💰 $100.0m Series B on 2022-09-06PrivacyCyber SecurityEnterprise SoftwareIdentity ManagementSoftware

  • Sense of curiosity, resourcefulness, and pragmatism.
  • Expertise in multi-region deployments in public cloud environments.
  • Demonstrable production Kubernetes experience (Managed Kubernetes, Helm, kubectl, kOps, etc.).
  • Strong background in Reliability Engineering, DevOps, Software Engineering.
  • Fluency in at least one programming language, such as C#, Python, or Go.
  • Experience with cloud deployment and automation tools/methodologies like GitOps, Terraform, or Pulumi.
  • Proficiency in using source control such as Git.
  • Ability to maintain discretion and improve security best practices.
  • Interest in new technologies and trends.
  • Collaborative and adaptable mindset with excellent communication skills.
  • Passion for open source and internet security.
  • Excellent problem-solving skills.

  • Take ownership of the Bitwarden cloud infrastructure, focusing on user satisfaction.
  • Evaluate current infrastructure regularly and make recommendations for reliability, security, and cost management.
  • Implement site reliability tools, monitoring, and observability across cloud environments.
  • Respond to infrastructure outages and contribute to 24x7 support strategy.
  • Engage in architectural designs and engineering operations at scale.
  • Participate in code reviews and share knowledge.
  • Contribute to incident management processes.
  • Collaborate with cross-functional teams on priorities and deliverables.
  • Align SLI/SLOs/SLAs with product owners.
  • Identify new initiatives for organizational needs.
  • Influence Bitwarden’s SDLC as it scales.
  • Mentor team members.

PythonGitKubernetesC#StrategyGoCommunication SkillsDevOpsTerraform

Posted 2024-12-03
Apply
Apply

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

💸 109047 - 169455 USD per year

🔍 Nonprofit Organization, Technology

🏢 Company: Wikimedia Foundation

  • At least two years of experience in an SRE/Operations/DevOps role as part of a team.
  • Experience supporting high availability distributed production systems.
  • Experience with database administration and support.
  • Knowledge of configuration management and orchestration tools (e.g., Puppet, Ansible).
  • Familiarity with observability infrastructure (monitoring, metrics, logging).
  • Proficient in shell and scripting languages (e.g., Python, Go, Bash, Ruby).
  • Understanding of Linux/Unix fundamentals and debugging skills.
  • Excellent written and verbal communication skills.
  • BS or MS degree in Computer Science or equivalent work experience.

  • Deployment, configuration, and maintenance of distributed data systems for the data and analytics platform.
  • Implement data quality monitoring to alert the team of possible data issues.
  • Collaborate with Fundraising to integrate data from various self-hosted and third-party sources.
  • Provide engineering support during high-traffic campaigns.
  • Document internal systems and processes.
  • Ensure compliance with relevant regulations, such as Donor Privacy Policy, GDPR, and PCI DSS.
  • Manage users and permissions for data access control.
  • Advise on best practices for data input and streamline processes.

PythonBashRubyData engineeringGoCommunication SkillsCollaborationLinuxDevOpsDocumentationCompliance

Posted 2024-12-03
Apply
Apply

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.

  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted 2024-11-24
Apply
Apply

📍 United States

🔍 Operations management platform for space exploration

  • Years of experience using Kubernetes (K8s).
  • Proficient in JavaScript.

  • Responsible for building complex infrastructure and deployment scenarios.
  • Support the maintenance and reliability of systems and applications.

AWSDockerJavascriptKubernetesReact.jsJavaScriptPostgresReact

Posted 2024-11-23
Apply
Apply

📍 United States

🔍 Broadcast Automation

🏢 Company: ARFA Solutions, LLC

  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in a Site Reliability Engineer role or similar position.
  • Strong knowledge of broadcast automation systems and workflows.
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Proficiency in scripting languages such as Python or Bash.
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP) and their services.
  • Familiarity with containerization technologies such as Docker and orchestration platforms like Kubernetes.
  • Solid understanding of configuration management tools (e.g., Ansible, Terraform).
  • Excellent problem-solving skills and a proactive approach to managing incidents.
  • Strong communication skills and the ability to collaborate with cross-functional teams.

  • Monitor broadcast automation systems to ensure high availability and performance.
  • Implement and manage automated deployment processes for broadcasting applications.
  • Troubleshoot and resolve incidents impacting broadcast automation services promptly and effectively.
  • Work closely with the development team to design and implement scalable solutions that meet reliability standards.
  • Participate in on-call rotations to provide support for critical incidents.
  • Develop and maintain documentation for system architecture, operational procedures, and incident reports.
  • Continuously assess and improve existing processes for reliability and efficiency.
  • Stay up-to-date with industry trends and technologies related to site reliability and broadcast automation.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCommunication SkillsTerraformDocumentation

Posted 2024-11-22
Apply
Apply

📍 United States

🧭 Full-Time

🔍 Cannabis industry

🏢 Company: Weedmaps

  • 10+ years in Site Reliability/DevOps or Software Engineering, including 3+ years of SaaS architectural experience.
  • Broad knowledge of distributed systems, performance testing, and observability patterns.
  • Proven ability to present effectively to management and cross-functional teams.
  • Experience with scalable architecture, automation of operational toil, and clean coding practices.

  • Collaborate with leadership to design and build solutions across Weedmaps ecosystem.
  • Work in a service-oriented architecture across multiple domains.
  • Drive critical initiatives and support application lifecycle processes.
  • Mentor and train engineers, influencing the technical direction.
  • Act as a subject matter expert in specific technologies and domains.

AWSDockerLeadershipKubernetesCommunication SkillsCI/CDDevOpsTerraformDocumentationMicroservicesCompliance

Posted 2024-11-16
Apply
Apply

📍 United States

🧭 Full-Time

💸 204000 - 281000 USD per year

🔍 Cybersecurity

🏢 Company: SentinelOne

  • Extensive SRE Experience: Proven experience in architecting and implementing SRE solutions at scale within a microservices or distributed systems environment.
  • 15+ years of progressive professional experience, with 5+ years of recent experience supporting enterprise SaaS environments.
  • Technical Expertise: Deep knowledge of incident management, alert correlation, automated triage, and SLO frameworks.
  • Proficiency in one or more programming languages (e.g., Python, Go, Java) with experience in automation and scripting.
  • Experience with machine learning and data analytics for real-time alert systems.
  • Expertise in cloud platforms (e.g., AWS, GCP, Azure) and container orchestration (e.g., Kubernetes).
  • Ability to make critical architectural decisions focused on business impact and system performance.

  • Design and guide the implementation of end-to-end alert correlation, auto-triage, and auto-remediation frameworks for a microservices SaaS architecture.
  • Ensure solutions align with business priorities and customer impact goals.
  • Define, implement, and monitor SLOs in collaboration with product and engineering teams.
  • Establish reliability standards to drive accountability around service performance.
  • Partner with software engineers, SREs, and data scientists to implement monitoring, alerting, and SLO solutions.
  • Lead initiatives promoting best practices across SentinelOne engineering.
  • Mentor engineers and contribute to a culture of reliability engineering excellence.

AWSLeadershipPythonData AnalysisGCPJavaKubernetesMachine LearningAzureData analysisGoCollaborationTerraformMicroservices

Posted 2024-11-15
Apply
Apply

📍 United States

💸 192000 - 288000 USD per year

🔍 Frontend Cloud and web services

🏢 Company: Vercel

  • At least 3 years of experience in an SRE role, or at least 5 years of experience in an adjacent role (e.g., platform engineering), operating in a scaled environment.
  • Firm grasp of the SRE philosophy and mindset, with practical experience working on or directly with SRE teams that have proactively engaged in system design and improvement.
  • Strong sense of accountability and commitment to problem-solving, backed by curiosity to dig deep and identify root causes.
  • Willingness to proactively engage with development teams to influence the course of software design and operational practices.
  • Capability to manage risk, make decisions, and exhibit sound judgment.
  • Demonstrated ability to plan and deliver long-term projects.
  • Familiarity with networking protocols and application serving.
  • Experience deploying and operating systems on AWS infrastructure at scale.
  • Bonus: Experience working with Terraform, Kubernetes, Golang, and/or Lua.

  • Ensure that our products are built for reliability and scale by engaging in the end-to-end design, development, and deployment of new software.
  • Drive continuous risk mitigation and reduction through direct involvement in incident management, blameless postmortems, and follow-ups.
  • Drive measurable improvements to the reliability, performance, and efficiency of our production systems through instrumentation, analysis, and implementation of engineering improvements.
  • Devise repeatable, low-toil operational practices through the development of automated systems for software delivery, system failover, and capacity management.

AWSProblem Solving

Posted 2024-11-13
Apply
Apply

📍 CA, CO, CT, FL, GA, HI, IL, IN, IA, MD, MA, MI, MO, NJ, NM, NY, NC, OH, PA, TN, TX, UT, VA, WA

🧭 Full-Time

💸 135520 - 178060 USD per year

🔍 Non-profit mental health support

🏢 Company: Crisis Text Line

  • Bachelor's degree in Computer Science, Engineering, or related field; Master’s preferred.
  • Proven experience as a Staff SRE or in a similar role.
  • Maintaining reliability of online SaaS/PaaS.
  • Proficiency in AWS and infrastructure as code (Terraform, CloudFormation).
  • Strong scripting skills (Python) and knowledge of containerization (Docker, Kubernetes).
  • Experience in CI/CD pipelines and observability tools (GitHub Actions, Datadog).
  • Understanding of network protocols and security principles.

  • Assisting to lead and mentor a team of 5 SREs.
  • Designing, implementing, and maintaining AWS infrastructure.
  • Collaborating with developers for performance optimization.
  • Developing monitoring, logging, and alerting systems.
  • Automating repetitive tasks to improve efficiency.
  • Responding to incidents to minimize downtime.
  • Supporting diversity on the engineering team.
  • Communicating expectations and progress clearly.
  • Providing mentorship and promoting technical best practices.
  • Participating in retrospectives to improve processes.
  • Conducting regular security audits.

AWSDockerGraphQLPHPPythonGCPKubernetesAzureData StructuresGoNext.jsCommunication SkillsCollaborationCI/CDDevOpsTerraformCompliance

Posted 2024-11-09
Apply
Apply

📍 USA

🧭 Full-Time

🔍 Cryptocurrency

🏢 Company: Referrals Only Board

  • At least 5+ years of software engineering experience.
  • Strong understanding of data structures and algorithms related to performance and reliability.
  • Fluency in at least one programming language such as Golang, Ruby, Python, or JavaScript.
  • Strong skills around observability, debugging, and performance tuning.
  • Ability to debug complex systems and willingness to understand and improve any layer of the stack.
  • Experience with container orchestration systems (Docker, ECS, EKS) and monitoring tools (DataDog, Graphite, Grafana, Prometheus).
  • Deep knowledge of UNIX/Linux system internals including system calls, TCP/IP, and debugging tools.
  • Strong communication skills and ability to explain technical concepts clearly.
  • Demonstrated critical thinking under pressure.

  • Build automation and improve systems to eliminate toil and operations work.
  • Improve observability, reliability, and availability by defining and measuring key metrics.
  • Collaborate with the core infrastructure team to performance tune and optimize cloud deployments.
  • Collaborate with product teams to reduce service disruptions and automate incident response.
  • Proactively find and analyze reliability problems and design software for improvements.
  • Facilitate incident response, conduct root cause analysis, and blameless retrospectives.
  • Educate and mentor the engineering team to enhance system reliability and promote reliability as a core value.

DockerPythonBlockchainEthereumJavascriptKubernetesRubyJavaScriptAlgorithmsData StructuresGolangCommunication SkillsLinuxTerraform

Posted 2024-11-07
Apply