Principal Technical Program Manager - AI Infrastructure Operations

New
N
NscaleAI Infrastructure
Location: USFull-TimePrincipal
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AgileSCRUMCI/CDLinuxNetworkingDistributed Systems

Requirements

  • 5+ years of experience in a Technical Program Management role.
  • Strong foundational understanding of data center infrastructure, distributed systems, Linux, and networking.
  • Proven expertise in modern program management methodologies (Agile, Scrum).
  • Experience defining, tracking, and improving system performance based on operational metrics (MTTR, SLOs/SLIs).
  • Ability to thrive in a fast-paced, high-growth environment.
  • Direct experience managing data center infrastructure build-outs and hardware commissioning (preferred).
  • Domain knowledge of AI/HPC infrastructure including NVIDIA GPUs and InfiniBand/RDMA (preferred).
  • Experience in a hyperscale or public cloud environment (preferred).
  • Familiarity with SRE principles, automation, and CI/CD pipelines (preferred).
  • Bachelor's or Master's degree in a technical field or equivalent practical experience.

Responsibilities

  • Own planning and execution of strategic operational programs including data center build-outs and fleet rollouts.
  • Establish and drive accountability for infrastructure KPIs like Availability (97.5%) and Uptime (99%).
  • Develop dashboards to provide visibility into operational health and program status.
  • Optimize operational workflows across Fleet, Network, and SRE teams.
  • Standardize incident management, change management, and postmortem processes.
  • Coordinate between engineering teams (Hardware, Compute, Network) and external vendors.
  • Partner with Data Science to translate capacity models into infrastructure roadmaps.
  • Proactively identify and mitigate technical and resource risks.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now