Principal Technical Program Manager - AI Infrastructure Operations
New
N
NscaleAI Infrastructure
Location: USFull-TimePrincipal
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AgileSCRUMCI/CDLinuxNetworkingDistributed Systems
Requirements
- 5+ years of experience in a Technical Program Management role.
- Strong foundational understanding of data center infrastructure, distributed systems, Linux, and networking.
- Proven expertise in modern program management methodologies (Agile, Scrum).
- Experience defining, tracking, and improving system performance based on operational metrics (MTTR, SLOs/SLIs).
- Ability to thrive in a fast-paced, high-growth environment.
- Direct experience managing data center infrastructure build-outs and hardware commissioning (preferred).
- Domain knowledge of AI/HPC infrastructure including NVIDIA GPUs and InfiniBand/RDMA (preferred).
- Experience in a hyperscale or public cloud environment (preferred).
- Familiarity with SRE principles, automation, and CI/CD pipelines (preferred).
- Bachelor's or Master's degree in a technical field or equivalent practical experience.
Responsibilities
- Own planning and execution of strategic operational programs including data center build-outs and fleet rollouts.
- Establish and drive accountability for infrastructure KPIs like Availability (97.5%) and Uptime (99%).
- Develop dashboards to provide visibility into operational health and program status.
- Optimize operational workflows across Fleet, Network, and SRE teams.
- Standardize incident management, change management, and postmortem processes.
- Coordinate between engineering teams (Hardware, Compute, Network) and external vendors.
- Partner with Data Science to translate capacity models into infrastructure roadmaps.
- Proactively identify and mitigate technical and resource risks.
View Full Description & ApplyYou'll be redirected to the employer's site