Senior Technical Product Manager, GPU Orchestration
New
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeSenior
Salary130,000 - 165,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 7+ years of product management experience
- Required Skills
- KubernetesProduct ManagementDistributed Systems
Requirements
- 7+ years of product management experience in cloud infrastructure, container orchestration, HPC, or developer platforms
- Deep understanding of Kubernetes, Slurm, or similar orchestration and scheduling systems, including GPU scheduling, resource management, and multi-tenant isolation
- Experience defining product strategy and roadmaps for platform or infrastructure products at scale
- Strong technical background — ability to engage with engineering on cluster lifecycle, control plane reliability, API design, and distributed systems
- Experience with AI/ML infrastructure, including training workloads, inference serving, and GPU resource optimization
- Track record of shipping developer- and operator-facing products with measurable impact on reliability, adoption, or operational efficiency
- Experience working across cross-functional teams (engineering, design, marketing, sales) in a fast-paced environment
- Excellent written and verbal communication skills
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
Responsibilities
- Define and execute the roadmap for managed Kubernetes, managed Slurm services, SUNK, and Run:ai integration
- Own the end-to-end cluster lifecycle, including provisioning, configuration, upgrades, scaling, high availability, and decommissioning
- Establish scheduling and resource management capabilities for GPU workloads, including quotas, fair-share policies, multi-tenant isolation, and priority handling
- Drive integration between orchestration services and core infrastructure components, including networking, storage, identity, observability, and billing systems
- Define service-level objectives for control plane reliability, job scheduling latency, cluster availability, and upgrade stability
- Design APIs, CLI tooling, and UI workflows that enable self-service cluster management and workload operations
- Partner with customer-facing teams to understand training, inference, and HPC use cases, translating real workload requirements into product capabilities
- Monitor industry trends in container orchestration, HPC scheduling, distributed systems, and AI infrastructure to inform product direction
View Full Description & ApplyYou'll be redirected to the employer's site