Site Reliability Engineer

Posted 2 months agoViewed

View full description

💎 Seniority level: Senior

📍 Location: USA

💸 Salary: 152000 - 175000 USD per year

🔍 Industry: AI and machine learning

🏢 Company: RunPod, Inc.

🗣️ Languages: English

🪄 Skills: DockerPythonGoGrafanaCommunication Skills

Requirements:

Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components.
Extensive experience with distributed system troubleshooting and design.
Proficiency in at least one programming language, preferably Python or Golang.
Proven experience implementing and managing SLIs and SLOs.
Experience with pull-based configuration management tools such as Chef or Puppet.
Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers.
Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems.
Comprehensive understanding of OSI model Layers 3, 4, and 7.
Successful completion of a background check.

Responsibilities:

Design, implement, and maintain robust, scalable, and highly available systems.
Troubleshoot and resolve complex issues in distributed environments.
Develop and implement SLIs and SLOs to ensure system reliability and performance.
Manage and optimize large-scale bare-metal fleets across multiple data centers.
Implement and maintain secure practices for foundational systems.
Collaborate with cross-functional teams to improve system design and operation.
Automate processes to increase efficiency and reduce human error.
Participate in on-call rotations to provide 24/7 support for critical systems.

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer

Posted 3 days ago

📍 United States

🧭 Full-Time

🔍 Cybersecurity

🏢 Company: Keeper Security, Inc.

5+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role focused on infrastructure and automation.
Proficiency with cloud platforms (AWS, Azure, Google Cloud) and infrastructure management tools like Terraform, Kubernetes, or Docker.
Experience with CI/CD tools (Jenkins, GitHub Actions, or GitLab CI).
Strong understanding of monitoring tools (Prometheus, Grafana, New Relic) for system reliability.
Experience with Linux, Mac OS X, Windows, and knowledge of scripting languages (Python, Bash, Go).
In-depth knowledge of networking concepts, security best practices, and incident management.
Ability to communicate complex technical issues clearly within a cross-functional team.
Proactive problem-solver with a collaborative mindset.

Design, implement, and manage the infrastructure and tools for CI/CD and software deployment.
Ensure high availability and performance of production systems by monitoring critical services.
Manage infrastructure automation using tools like Terraform or Kubernetes.
Support security audits to meet regulatory standards.
Collaborate with teams to optimize build and release pipelines.
Troubleshoot issues related to system performance and reliability.
Stay updated with trends in infrastructure and DevOps practices.
Contribute to monitoring and alerting systems to identify reliability issues.
Promote a reliability culture by defining and achieving reliability goals.

AWSDockerPythonBashJenkinsKubernetesMac OS XGoGrafanaPrometheusLinuxTerraform

Posted 3 days ago

Apply

🔥 Site Reliability Engineer | North America | Canada | Europe | Fully Remote

Posted 9 days ago

📍 United States, United Kingdom, Spain, Italy, Canada

🔍 Interactive entertainment

🏢 Company: Escape Velocity Entertainment Inc

5+ years of experience in a Site Reliability, DevOps, or Platform engineering role.
5+ years of experience with observability, application monitoring, telemetry collection, and data visualization tools.
Experience with GitOps workflows and Helix Core / Perforce.
Experience in implementing and maintaining CI/CD systems.
Expertise in Infrastructure as Code design using Ansible, Terraform, and CloudFormation.
Experience with backend game engines like Pragma, GameLift, or Agones.
Experience with capacity planning and FinOps.
Proficiency in one or more high-level languages such as Python, Kotlin, JavaScript, or C++.
Strong Linux skills and understanding of public cloud services.

Analyze, implement, and improve complex systems responsible for delivering games to millions of fans.
Own the delivery, scalability, and reliability of the cloud-hosted game title.
Partner with game teams to advise and implement best practices.
Take ownership of projects from start to finish while maintaining quality.
Engage with product teams to diagnose and resolve operational issues.
Maintain relationships with internal and external partners.
Optimize reliability, availability, observability, and cost.
Participate in on-call rotations for critical incidents.

PythonJavascriptKotlinC++CI/CDLinuxTerraformData visualizationAnsible

Posted 9 days ago

Apply

🔥 Site Reliability Engineer

Posted 14 days ago

📍 United States

🧭 Full-Time

💸 125000.0 - 135000.0 USD per year

🔍 Transaction and compliance software for state and local governments

🏢 Company: GovOS

At least 2 years of experience managing, troubleshooting, and optimizing Linux and Windows environments.
Demonstrated ability in various programming and scripting languages (e.g., Python, Bash, PowerShell).
Hands-on experience in designing, deploying, and maintaining cloud infrastructure in AWS or Azure.
In-depth knowledge of container technologies (Docker, Podman) and orchestration platforms (Kubernetes, ECS, AKS).
Strong expertise in version control systems (e.g., Git) and configuration management tools (e.g., Ansible, Terraform, Chef).
Experience in administering and optimizing databases (e.g., MySQL, PostgreSQL, MongoDB).

Enhance Developer Workflows: Design and implement productive developer workflows, focusing on automation.
Continuous Integration & Deployment: Build, optimize, and maintain CI/CD pipelines.
Environment Design: Collaborate on the design of production and non-production environments for scalability and reliability.
Data Security: Develop processes to protect customer data, aligning with best practices.
Incident Management: Participate in on-call rotations for system reliability.
Team Contribution: Support the team with other duties and initiatives.

AWSDockerPostgreSQLPythonBashGitJenkinsKubernetesMongoDBMySQLAzureLinuxTerraformAnsible

Posted 14 days ago

Apply

🔥 Site Reliability Engineer, Customer Security

Posted 17 days ago

📍 United States, Canada

🧭 Full-Time

💸 108000.0 - 163900.0 USD per year

🔍 Active Insurance, Digital Risk Management

🏢 Company: Coalition, Inc.

3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles in a full stack engineering environment.
Strong understanding of AWS services (e.g., EC2, S3, RDS, Lambda, VPC).
Hands-on experience with IaC tools like Terraform, CloudFormation, or CDK.
Experience with containerization and orchestration tools such as ECS and Kubernetes.
Experience working with fault-tolerant services and developing highly available systems.
Exposure to full-stack monitoring and CI/CD pipelines.
Some knowledge of software engineering design patterns, agile development, and architecture principles.
Strong analytical and problem-solving skills.

Play a pivotal role in ensuring the performance, availability, and efficiency of cloud-based systems.
Design, implement, and manage robust cloud solutions.
Automate infrastructure and build developer-friendly platforms.
Optimize cloud resources and improve system observability.
Drive operational excellence across the organization.
Participate in a low-volume on-call rotation to maintain system reliability.

AWSDockerPythonKubernetesGoCI/CDTerraform

Posted 17 days ago

Apply

🔥 Site Reliability Engineer, Customer Security

Posted 17 days ago

📍 United States, Canada

💸 108000.0 - 163900.0 USD per year

🔍 Insurance, Cybersecurity

3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
Strong understanding of AWS services and best practices.
Hands-on experience with IaC tools like Terraform or CloudFormation.
Experience with containerization tools such as ECS or Kubernetes.
Exposure to full-stack monitoring and CI/CD pipelines.
Strong analytical and problem-solving skills.

Design, implement, and manage robust cloud solutions.
Work closely with cross-functional teams.
Isolate, trap, and respond to system failures.
Develop strategies for continuous monitoring and analysis.
Participate in a low-volume on-call rotation to maintain reliability.

AWSDockerPythonJavaKafkaKubernetesGoCI/CDTerraform

Posted 17 days ago

Apply

🔥 Site Reliability Engineer, Customer Security

Posted 17 days ago

📍 United States, Canada

🔍 Active Insurance, Digital Risk Management

3+ years of experience in SRE/DevOps/Cloud engineering or Software Development roles.
Strong understanding of AWS services and best practices for building scalable and secure infrastructure.
Experience with IaC tools like Terraform, CloudFormation, or CDK.
Hands-on experience with containerization and orchestration tools such as ECS or Kubernetes.
Experience with fault tolerant services and highly-available systems.
Understanding of CI/CD pipelines and security auditability.

Play a pivotal role in ensuring performance, availability, and efficiency of cloud-based systems.
Design, implement, and manage robust cloud solutions.
Automate infrastructure and build developer-friendly platforms.
Participate in a low-volume on-call rotation to maintain system reliability.
Develop strategies for continuous monitoring and minimize downtime.

AWSPythonKafkaKubernetesGoCI/CDTerraform

Posted 17 days ago

Apply

🔥 Sr Site Reliability Engineer, Platform Engineering

Posted about 1 month ago

📍 United States

🧭 Full-Time

💸 186000.0 - 251000.0 USD per year

🔍 Network observability

🏢 Company: Kentik👥 101-250💰 $40,000,000 Series C over 3 years agoCloud Data Services Information Technology Network Security Software

5+ years of experience in Systems Administration, Datacenter/IT and/or SRE related projects.
Experience with *nix system command line (e.g., ssh, grep, awk).
Detailed understanding of major internet protocols such as tcp/ip, dns, http, TLS.
Experience with or desire to learn about microservices, containers, and orchestration.
Networking administration experience with concepts like routing and firewalls.
Passion for documenting code, processes, and infrastructure in runbooks and wikis.
Strong collaboration and communication skills for a fully remote environment.
Experience with configuration management (e.g., Ansible, Puppet, Chef).
Familiarity with metrics monitoring solutions (e.g., Grafana, Prometheus).
Automation skills in coding languages like Bash, Python, Ruby, or Go.
Experience with public cloud services (AWS, GCP, Azure) and Terraform.

Ensure our real-time, scalable, microservices-based infrastructure is set up for growth and working efficiently.
Work on tools and processes to better monitor our platform and ensure its stability through rapid growth.
Deep dive into diverse topics including NetFlow, IP routing, database replication strategies, or HTTP optimization.
Collaborate with engineering and infrastructure teams on operational solutions.
Contribute code, engage in code reviews, and write design documents for new features or changes.
Provide valuable feedback on team goals, projects, and processes for continuous improvement.

DockerPythonBashCloud ComputingKafka*NixGoGrafanagRPCPostgresPrometheusRedisTerraformMicroservices

Posted about 1 month ago

Apply

🔥 Sr. Staff Site Reliability Engineer - Federal, Security Clearance Required

Posted about 2 months ago

📍 Virginia, USA

🧭 Full-Time

💸 136500.0 - 195000.0 USD per year

🔍 Cybersecurity, Cloud Security

🏢 Company: Zscaler

Over 5 years of Site Reliability Engineering experience in both Operations and Engineering environments.
Extensive experience with High/Moderate FedRAMP authorization levels and monthly monitoring, including vulnerability scanning, evaluation, patching, and reporting.
Proficiency in Linux administration, network troubleshooting, and automation tools like Ansible and Terraform for infrastructure as code.
Skilled in Python coding, with knowledge of container-based architectures (AWS ECS, Kubernetes), virtualization, cloud services, web security, and networking protocols (HTTP, SSL/TLS, DNS, SQL).

Oversee operational tasks for FedRAMP cloud products, including deployments, on-call duties, and incident management.
Participate in regular deployment sync meetings and operational hand-offs.
Manage all cloud infrastructure components such as AWS GovCloud, private cloud environments, containers, and VMs.
Develop operations documentation, handle escalations, and implement measures to prevent recurring incidents while contributing to DevOps best practices.

AWSPythonKubernetesLinuxTerraformAnsible

Posted about 2 months ago

Apply

🔥 Senior Site Reliability Engineer

Posted about 2 months ago

📍 US and Canada

🧭 Full-Time

💸 150000 - 200000 USD per year

🔍 Healthcare

🏢 Company: Synthesis Health👥 51-100💰 Seed about 2 years agoMedical Wellness Health Care

Bachelor's degree or Diploma in computer science, engineering, mathematics, or related field.
At least one year of experience as a Python developer transitioning to an SRE role.
Five years of experience in software development as a DevOps and/or SRE.
Two years of experience in an SRE role with Kubernetes, preferably GKE.
Experience using ArgoCD for rollouts and deployments.
One year experience with service mesh like Istio in a GKE environment.
Proficiency in scripting languages like Python and automation tools like Terraform.
Solid understanding of security best practices for pipelines and cloud environments.
Familiarity with compliance standards like SOC 2, HIPAA.
Strong expertise in CI/CD pipeline management.

Design and implement automated application deployment processes.
Establish and measure Service Level Objectives (SLO) and Budgets (SLB).
Manage development, testing, staging, pre-production and production environments.
Automate repetitive deployment tasks to improve productivity.
Select, develop, and monitor CI/CD systems.
Oversee software automation across GCP.
Containerize services to optimize resources and deployment speed.
Manage and optimize cloud infrastructure for cost and performance.
Ensure compliance with security standards and maintain disaster recovery plans.
Collaborate with cross-functional teams to improve software delivery.

LeadershipPythonSoftware DevelopmentGCPGitKubernetesSoftware ArchitectureAnalytical SkillsCollaborationCI/CDCustomer serviceDevOpsTerraformOrganizational skillsDocumentationCompliance

Posted about 2 months ago

Apply

🔥 Site Reliability Engineer (Broadcast Automation)

Posted about 2 months ago

📍 United States

🔍 Broadcast Automation

🏢 Company: ARFA Solutions, LLC

Bachelor's degree in Computer Science, Engineering, or a related field.
3+ years of experience in a Site Reliability Engineer role or similar position.
Strong knowledge of broadcast automation systems and workflows.
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Proficiency in scripting languages such as Python or Bash.
Hands-on experience with cloud platforms (AWS, Azure, or GCP) and their services.
Familiarity with containerization technologies such as Docker and orchestration platforms like Kubernetes.
Solid understanding of configuration management tools (e.g., Ansible, Terraform).
Excellent problem-solving skills and a proactive approach to managing incidents.
Strong communication skills and the ability to collaborate with cross-functional teams.

Monitor broadcast automation systems to ensure high availability and performance.
Implement and manage automated deployment processes for broadcasting applications.
Troubleshoot and resolve incidents impacting broadcast automation services promptly and effectively.
Work closely with the development team to design and implement scalable solutions that meet reliability standards.
Participate in on-call rotations to provide support for critical incidents.
Develop and maintain documentation for system architecture, operational procedures, and incident reports.
Continuously assess and improve existing processes for reliability and efficiency.
Stay up-to-date with industry trends and technologies related to site reliability and broadcast automation.

AWSDockerPythonBashGCPKubernetesAzureGrafanaPrometheusCommunication SkillsTerraformDocumentation

Posted about 2 months ago

Apply

Site Reliability Engineer

Requirements:

Responsibilities:

Related Jobs

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities

🔧 Requirements

💡 Responsibilities