Principal Site Reliability Engineer

New

NscaleAI Infrastructure

AMERFull-TimePrincipal

Salary150,000 - 2,150,000 USD per year

Apply NowOpens the employer's application page

Job Details

10+ years of experience in SRE, Systems, or Software Engineering in large-scale infrastructure
Expert-level software engineering skills
Deep expertise in Linux, networking, and distributed systems design
Extensive experience debugging failures across hardware, OS, networking, and application layers
Proven ability to lead technical initiatives across teams without direct authority
Strong systems-thinking mindset

Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
Defining reliability standards, SLO frameworks, and operational best practices
Acting as a senior technical escalation point during critical incidents
Identifying structural reliability risks and driving cross-functional initiatives
Partnering with Engineering, Network Operations, and Fleet Operations leadership
Mentoring senior and mid-level engineers
Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability

View Full Description & ApplyYou'll be redirected to the employer's site