Ensure infrastructure and system software are production-ready for new hardware and compute platforms. Drive end-to-end programs spanning GPU provisioning, at-scale deployments, Fleet NPI readiness, and vendor management. Coordinate with hardware compute engineering, Fleet teams, and external vendors to maintain service reliability, enforce SLAs, and lead incident response efforts. Partner with engineering teams to improve monitoring, telemetry, and fleet observability for proactive performance management. Define and track metrics around GPU fleet health, performance, and reliability. Run post-incident reviews and drive action items that enhance system reliability and prevent regressions. Collaborate with internal customers to collect feedback, enable adoption of core infrastructure platforms, and refine onboarding experiences. Work closely with Product, Infrastructure, Platform Engineering, Vendor, and Customer Experiences to align on roadmap priorities and customer delivery timelines. Communicate program status, risks, and critical decisions to senior leadership and executive stakeholders.