Senior Technical Product Manager, GPU Orchestration

New
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeSenior
Salary130,000 - 165,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
7+ years of product management experience
Required Skills
KubernetesProduct ManagementDistributed Systems

Requirements

  • 7+ years of product management experience in cloud infrastructure, container orchestration, HPC, or developer platforms
  • Deep understanding of Kubernetes, Slurm, or similar orchestration and scheduling systems, including GPU scheduling, resource management, and multi-tenant isolation
  • Experience defining product strategy and roadmaps for platform or infrastructure products at scale
  • Strong technical background — ability to engage with engineering on cluster lifecycle, control plane reliability, API design, and distributed systems
  • Experience with AI/ML infrastructure, including training workloads, inference serving, and GPU resource optimization
  • Track record of shipping developer- and operator-facing products with measurable impact on reliability, adoption, or operational efficiency
  • Experience working across cross-functional teams (engineering, design, marketing, sales) in a fast-paced environment
  • Excellent written and verbal communication skills
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)

Responsibilities

  • Define and execute the roadmap for managed Kubernetes, managed Slurm services, SUNK, and Run:ai integration
  • Own the end-to-end cluster lifecycle, including provisioning, configuration, upgrades, scaling, high availability, and decommissioning
  • Establish scheduling and resource management capabilities for GPU workloads, including quotas, fair-share policies, multi-tenant isolation, and priority handling
  • Drive integration between orchestration services and core infrastructure components, including networking, storage, identity, observability, and billing systems
  • Define service-level objectives for control plane reliability, job scheduling latency, cluster availability, and upgrade stability
  • Design APIs, CLI tooling, and UI workflows that enable self-service cluster management and workload operations
  • Partner with customer-facing teams to understand training, inference, and HPC use cases, translating real workload requirements into product capabilities
  • Monitor industry trends in container orchestration, HPC scheduling, distributed systems, and AI infrastructure to inform product direction
View Full Description & ApplyYou'll be redirected to the employer's site
130,000 - 165,000 USD per year
Apply Now