Apply

Site Reliability Engineer - Manager

Posted 2024-10-19

View full description

💎 Seniority level: Manager, 5+ years

📍 Location: USA

💸 Salary: 180000 - 200000 USD per year

🔍 Industry: AI and machine learning

🏢 Company: RunPod, Inc.

🗣️ Languages: English

⏳ Experience: 5+ years

🪄 Skills: AWSLeadershipPythonGCPMachine LearningAzureGolangCommunication SkillsProblem Solving

Requirements:
  • 5+ years of experience in Site Reliability Engineering or a similar role.
  • 3+ years of experience in a technical leadership or management position.
  • Deep understanding of Linux systems, containerization, virtualization, and networking technologies.
  • Strong background in managing large-scale distributed systems and bare-metal fleets.
  • Expertise in infrastructure-as-code and configuration management tools.
  • Proficiency in at least one programming language, preferably Python or Golang.
  • Experience with cloud platforms (AWS, GCP, Azure) and services.
  • Strong knowledge of monitoring, observability, and alerting systems.
  • Excellent problem-solving skills for managing complex incidents.
  • Proven track record with SLIs, SLOs, and SLAs.
  • Strong communication skills for conveying technical concepts to various stakeholders.
Responsibilities:
  • Lead and mentor a team of Site Reliability Engineers, fostering a culture of innovation and technical excellence.
  • Develop and implement strategic plans to enhance the reliability and scalability of our infrastructure.
  • Collaborate with cross-functional teams to align SRE initiatives with organizational goals.
  • Establish and maintain SLIs, SLOs, and SLAs for critical systems.
  • Drive the adoption of best practices in automation and incident response.
  • Oversee management of large-scale bare-metal fleets across multiple data centers.
  • Ensure robust security practices are implemented throughout infrastructure.
  • Manage on-call rotations and escalate during critical incidents.
  • Contribute to capacity planning and resource allocation for growth.
  • Develop and track KPIs for SRE team and infrastructure health.
Apply