Principal Site Reliability Engineer
New
N
NscaleAI Infrastructure
AMERFull-TimePrincipal
Salary150,000 - 2,150,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 10+ years
- Required Skills
- KubernetesLinuxNetworkingDistributed Systems
Requirements
- 10+ years of experience in SRE, Systems, or Software Engineering in large-scale infrastructure
- Expert-level software engineering skills
- Deep expertise in Linux, networking, and distributed systems design
- Extensive experience debugging failures across hardware, OS, networking, and application layers
- Proven ability to lead technical initiatives across teams without direct authority
- Strong systems-thinking mindset
Responsibilities
- Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
- Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
- Defining reliability standards, SLO frameworks, and operational best practices
- Acting as a senior technical escalation point during critical incidents
- Identifying structural reliability risks and driving cross-functional initiatives
- Partnering with Engineering, Network Operations, and Fleet Operations leadership
- Mentoring senior and mid-level engineers
- Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability
View Full Description & ApplyYou'll be redirected to the employer's site