Apply

Site Reliability Engineer

Posted 2 months agoViewed

View full description

💎 Seniority level: Senior

📍 Location: USA

💸 Salary: 152000 - 175000 USD per year

🔍 Industry: AI and machine learning

🏢 Company: RunPod, Inc.

🗣️ Languages: English

🪄 Skills: DockerPythonGoGrafanaCommunication Skills

Requirements:
  • Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components.
  • Extensive experience with distributed system troubleshooting and design.
  • Proficiency in at least one programming language, preferably Python or Golang.
  • Proven experience implementing and managing SLIs and SLOs.
  • Experience with pull-based configuration management tools such as Chef or Puppet.
  • Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers.
  • Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems.
  • Comprehensive understanding of OSI model Layers 3, 4, and 7.
  • Successful completion of a background check.
Responsibilities:
  • Design, implement, and maintain robust, scalable, and highly available systems.
  • Troubleshoot and resolve complex issues in distributed environments.
  • Develop and implement SLIs and SLOs to ensure system reliability and performance.
  • Manage and optimize large-scale bare-metal fleets across multiple data centers.
  • Implement and maintain secure practices for foundational systems.
  • Collaborate with cross-functional teams to improve system design and operation.
  • Automate processes to increase efficiency and reduce human error.
  • Participate in on-call rotations to provide 24/7 support for critical systems.
Apply

Related Jobs

Apply

📍 United States

🧭 Full-Time

🔍 Cybersecurity

🏢 Company: Keeper Security, Inc.

  • 5+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role focused on infrastructure and automation.
  • Proficiency with cloud platforms (AWS, Azure, Google Cloud) and infrastructure management tools like Terraform, Kubernetes, or Docker.
  • Experience with CI/CD tools (Jenkins, GitHub Actions, or GitLab CI).
  • Strong understanding of monitoring tools (Prometheus, Grafana, New Relic) for system reliability.
  • Experience with Linux, Mac OS X, Windows, and knowledge of scripting languages (Python, Bash, Go).
  • In-depth knowledge of networking concepts, security best practices, and incident management.
  • Ability to communicate complex technical issues clearly within a cross-functional team.
  • Proactive problem-solver with a collaborative mindset.

  • Design, implement, and manage the infrastructure and tools for CI/CD and software deployment.
  • Ensure high availability and performance of production systems by monitoring critical services.
  • Manage infrastructure automation using tools like Terraform or Kubernetes.
  • Support security audits to meet regulatory standards.
  • Collaborate with teams to optimize build and release pipelines.
  • Troubleshoot issues related to system performance and reliability.
  • Stay updated with trends in infrastructure and DevOps practices.
  • Contribute to monitoring and alerting systems to identify reliability issues.
  • Promote a reliability culture by defining and achieving reliability goals.

AWSDockerPythonBashJenkinsKubernetesMac OS XGoGrafanaPrometheusLinuxTerraform

Posted 3 days ago
Apply
Apply

📍 United States, United Kingdom, Spain, Italy, Canada

🔍 Interactive entertainment

🏢 Company: Escape Velocity Entertainment Inc

  • 5+ years of experience in a Site Reliability, DevOps, or Platform engineering role.
  • 5+ years of experience with observability, application monitoring, telemetry collection, and data visualization tools.
  • Experience with GitOps workflows and Helix Core / Perforce.
  • Experience in implementing and maintaining CI/CD systems.
  • Expertise in Infrastructure as Code design using Ansible, Terraform, and CloudFormation.
  • Experience with backend game engines like Pragma, GameLift, or Agones.
  • Experience with capacity planning and FinOps.
  • Proficiency in one or more high-level languages such as Python, Kotlin, JavaScript, or C++.
  • Strong Linux skills and understanding of public cloud services.

  • Analyze, implement, and improve complex systems responsible for delivering games to millions of fans.
  • Own the delivery, scalability, and reliability of the cloud-hosted game title.
  • Partner with game teams to advise and implement best practices.
  • Take ownership of projects from start to finish while maintaining quality.
  • Engage with product teams to diagnose and resolve operational issues.
  • Maintain relationships with internal and external partners.
  • Optimize reliability, availability, observability, and cost.
  • Participate in on-call rotations for critical incidents.

PythonJavascriptKotlinC++CI/CDLinuxTerraformData visualizationAnsible

Posted 9 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 125000.0 - 135000.0 USD per year

🔍 Transaction and compliance software for state and local governments

🏢 Company: GovOS

  • At least 2 years of experience managing, troubleshooting, and optimizing Linux and Windows environments.
  • Demonstrated ability in various programming and scripting languages (e.g., Python, Bash, PowerShell).
  • Hands-on experience in designing, deploying, and maintaining cloud infrastructure in AWS or Azure.
  • In-depth knowledge of container technologies (Docker, Podman) and orchestration platforms (Kubernetes, ECS, AKS).
  • Strong expertise in version control systems (e.g., Git) and configuration management tools (e.g., Ansible, Terraform, Chef).
  • Experience in administering and optimizing databases (e.g., MySQL, PostgreSQL, MongoDB).

  • Enhance Developer Workflows: Design and implement productive developer workflows, focusing on automation.
  • Continuous Integration & Deployment: Build, optimize, and maintain CI/CD pipelines.
  • Environment Design: Collaborate on the design of production and non-production environments for scalability and reliability.
  • Data Security: Develop processes to protect customer data, aligning with best practices.
  • Incident Management: Participate in on-call rotations for system reliability.
  • Team Contribution: Support the team with other duties and initiatives.

AWSDockerPostgreSQLPythonBashGitJenkinsKubernetesMongoDBMySQLAzureLinuxTerraformAnsible

Posted 14 days ago
Apply
Apply

📍 United States, Canada

🧭 Full-Time

💸 108000.0 - 163900.0 USD per year

🔍 Active Insurance, Digital Risk Management

🏢 Company: Coalition, Inc.

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles in a full stack engineering environment.
  • Strong understanding of AWS services (e.g., EC2, S3, RDS, Lambda, VPC).
  • Hands-on experience with IaC tools like Terraform, CloudFormation, or CDK.
  • Experience with containerization and orchestration tools such as ECS and Kubernetes.
  • Experience working with fault-tolerant services and developing highly available systems.
  • Exposure to full-stack monitoring and CI/CD pipelines.
  • Some knowledge of software engineering design patterns, agile development, and architecture principles.
  • Strong analytical and problem-solving skills.

  • Play a pivotal role in ensuring the performance, availability, and efficiency of cloud-based systems.
  • Design, implement, and manage robust cloud solutions.
  • Automate infrastructure and build developer-friendly platforms.
  • Optimize cloud resources and improve system observability.
  • Drive operational excellence across the organization.
  • Participate in a low-volume on-call rotation to maintain system reliability.

AWSDockerPythonKubernetesGoCI/CDTerraform

Posted 17 days ago
Apply
Apply

📍 United States, Canada

💸 108000.0 - 163900.0 USD per year

🔍 Insurance, Cybersecurity

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
  • Strong understanding of AWS services and best practices.
  • Hands-on experience with IaC tools like Terraform or CloudFormation.
  • Experience with containerization tools such as ECS or Kubernetes.
  • Exposure to full-stack monitoring and CI/CD pipelines.
  • Strong analytical and problem-solving skills.

  • Design, implement, and manage robust cloud solutions.
  • Work closely with cross-functional teams.
  • Isolate, trap, and respond to system failures.
  • Develop strategies for continuous monitoring and analysis.
  • Participate in a low-volume on-call rotation to maintain reliability.

AWSDockerPythonJavaKafkaKubernetesGoCI/CDTerraform

Posted 17 days ago
Apply
Apply

📍 United States, Canada

🔍 Active Insurance, Digital Risk Management

  • 3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
  • Strong understanding of AWS services and best practices for building scalable and secure infrastructure.
  • Experience with IaC tools like Terraform, CloudFormation, or CDK.
  • Hands-on experience with containerization and orchestration tools such as ECS or Kubernetes.
  • Experience with fault tolerant services and highly-available systems.
  • Understanding of CI/CD pipelines and security auditability.

  • Play a pivotal role in ensuring performance, availability, and efficiency of cloud-based systems.
  • Design, implement, and manage robust cloud solutions.
  • Automate infrastructure and build developer-friendly platforms.
  • Participate in a low-volume on-call rotation to maintain system reliability.
  • Develop strategies for continuous monitoring and minimize downtime.

AWSPythonKafkaKubernetesGoCI/CDTerraform

Posted 17 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 186000.0 - 251000.0 USD per year

🔍 Network observability

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data ServicesInformation TechnologyNetwork SecuritySoftware

  • 5+ years of experience in Systems Administration, Datacenter/IT and/or SRE related projects.
  • Experience with *nix system command line (e.g., ssh, grep, awk).
  • Detailed understanding of major internet protocols such as tcp/ip, dns, http, TLS.
  • Experience with or desire to learn about microservices, containers, and orchestration.
  • Networking administration experience with concepts like routing and firewalls.
  • Passion for documenting code, processes, and infrastructure in runbooks and wikis.
  • Strong collaboration and communication skills for a fully remote environment.
  • Experience with configuration management (e.g., Ansible, Puppet, Chef).
  • Familiarity with metrics monitoring solutions (e.g., Grafana, Prometheus).
  • Automation skills in coding languages like Bash, Python, Ruby, or Go.
  • Experience with public cloud services (AWS, GCP, Azure) and Terraform.

  • Ensure our real-time, scalable, microservices-based infrastructure is set up for growth and working efficiently.
  • Work on tools and processes to better monitor our platform and ensure its stability through rapid growth.
  • Deep dive into diverse topics including NetFlow, IP routing, database replication strategies, or HTTP optimization.
  • Collaborate with engineering and infrastructure teams on operational solutions.
  • Contribute code, engage in code reviews, and write design documents for new features or changes.
  • Provide valuable feedback on team goals, projects, and processes for continuous improvement.

DockerPythonBashCloud ComputingKafka*NixGoGrafanagRPCPostgresPrometheusRedisTerraformMicroservices

Posted about 1 month ago
Apply
Apply

📍 Virginia, USA

🧭 Full-Time

💸 136500.0 - 195000.0 USD per year

🔍 Cybersecurity, Cloud Security

🏢 Company: Zscaler

  • Over 5 years of Site Reliability Engineering experience in both Operations and Engineering environments.
  • Extensive experience with High/Moderate FedRAMP authorization levels and monthly monitoring, including vulnerability scanning, evaluation, patching, and reporting.
  • Proficiency in Linux administration, network troubleshooting, and automation tools like Ansible and Terraform for infrastructure as code.
  • Skilled in Python coding, with knowledge of container-based architectures (AWS ECS, Kubernetes), virtualization, cloud services, web security, and networking protocols (HTTP, SSL/TLS, DNS, SQL).

  • Oversee operational tasks for FedRAMP cloud products, including deployments, on-call duties, and incident management.
  • Participate in regular deployment sync meetings and operational hand-offs.
  • Manage all cloud infrastructure components such as AWS GovCloud, private cloud environments, containers, and VMs.
  • Develop operations documentation, handle escalations, and implement measures to prevent recurring incidents while contributing to DevOps best practices.

AWSPythonKubernetesLinuxTerraformAnsible

Posted about 2 months ago
Apply
Apply

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health👥 51-100💰 Seed about 2 years agoMedicalWellnessHealth Care

  • Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
  • At least one year of experience as a Python developer transitioning to an SRE role.
  • Five years of experience in software development as a DevOps and/or SRE.
  • Two years of experience in an SRE role with Kubernetes, preferably GKE.
  • Experience using ArgoCD for rollouts and deployments.
  • One year experience with service mesh like Istio in a GKE environment.
  • Proficiency in scripting languages like Python and automation tools like Terraform.
  • Solid understanding of security best practices for pipelines and cloud environments.
  • Familiarity with compliance standards like SOC 2, HIPAA.
  • Strong expertise in CI/CD pipeline management.

  • Design and implement automated application deployment processes.
  • Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
  • Manage development, testing, staging, pre-production and production environments.
  • Automate repetitive deployment tasks to improve productivity.
  • Select, develop, and monitor CI/CD systems.
  • Oversee software automation across GCP.
  • Containerize services to optimize resources and deployment speed.
  • Manage and optimize cloud infrastructure for cost and performance.
  • Ensure compliance with security standards and maintain disaster recovery plans.
  • Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted about 2 months ago
Apply
Apply

📍 United States

🔍 Broadcast Automation

🏢 Company: ARFA Solutions, LLC

  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in a Site Reliability Engineer role or similar position.
  • Strong knowledge of broadcast automation systems and workflows.
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Proficiency in scripting languages such as Python or Bash.
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP) and their services.
  • Familiarity with containerization technologies such as Docker and orchestration platforms like Kubernetes.
  • Solid understanding of configuration management tools (e.g., Ansible, Terraform).
  • Excellent problem-solving skills and a proactive approach to managing incidents.
  • Strong communication skills and the ability to collaborate with cross-functional teams.

  • Monitor broadcast automation systems to ensure high availability and performance.
  • Implement and manage automated deployment processes for broadcasting applications.
  • Troubleshoot and resolve incidents impacting broadcast automation services promptly and effectively.
  • Work closely with the development team to design and implement scalable solutions that meet reliability standards.
  • Participate in on-call rotations to provide support for critical incidents.
  • Develop and maintain documentation for system architecture, operational procedures, and incident reports.
  • Continuously assess and improve existing processes for reliability and efficiency.
  • Stay up-to-date with industry trends and technologies related to site reliability and broadcast automation.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCommunication SkillsTerraformDocumentation

Posted about 2 months ago
Apply