Principal Site Reliability Engineer

New
N
NscaleAI Infrastructure
AMERFull-TimePrincipal
Salary150,000 - 2,150,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
10+ years
Required Skills
KubernetesLinuxNetworkingDistributed Systems

Requirements

  • 10+ years of experience in SRE, Systems, or Software Engineering in large-scale infrastructure
  • Expert-level software engineering skills
  • Deep expertise in Linux, networking, and distributed systems design
  • Extensive experience debugging failures across hardware, OS, networking, and application layers
  • Proven ability to lead technical initiatives across teams without direct authority
  • Strong systems-thinking mindset

Responsibilities

  • Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
  • Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
  • Defining reliability standards, SLO frameworks, and operational best practices
  • Acting as a senior technical escalation point during critical incidents
  • Identifying structural reliability risks and driving cross-functional initiatives
  • Partnering with Engineering, Network Operations, and Fleet Operations leadership
  • Mentoring senior and mid-level engineers
  • Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability
View Full Description & ApplyYou'll be redirected to the employer's site
150,000 - 2,150,000 USD per year
Apply Now