Senior Technical Product Manager, Observability
New
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeSenior
Salary130,000 - 165,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 7+ years
- Required Skills
- KubernetesProduct ManagementDistributed Systems
Requirements
- 7+ years of product management experience in cloud infrastructure, observability, monitoring, or developer platforms
- Deep understanding of observability and monitoring systems, including metrics, logging, tracing, alerting, and telemetry pipeline architecture
- Experience defining product strategy and roadmaps for platform or infrastructure products at scale
- Strong technical background — ability to engage with engineering on telemetry agents, data models, query engines, retention, and distributed systems
- Experience with GPU, AI/ML, or HPC infrastructure monitoring and the unique observability challenges of training and inference workloads
- Track record of shipping developer- and operator-facing products with measurable impact on reliability, time-to-detect, or operational efficiency
- Experience working across cross-functional teams (engineering, design, marketing, sales) in a fast-paced environment
- Excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
Responsibilities
- Own the end-to-end Observability Platform roadmap across telemetry ingestion, querying, visualization, alerting, and retention for large-scale GPU clusters and multi-tenant cloud environments
- Define Vultr's observability strategy across bare metal, VMs, Kubernetes, and managed services, aligned to infrastructure roadmap, reliability goals, and customer experience
- Drive the customer-facing observability surface across dashboards, APIs, telemetry pipelines, and topology-aware insights
- Translate low-level signals across GPU, CPU, memory, storage, and network into actionable health views, alerts, and debugging workflows for customers
- Work closely with engineering on technical tradeoffs across metrics agents, collectors, data models, telemetry pipelines, APIs, and retention architecture
- Build products for distributed AI environments by understanding how training and inference workloads behave across nodes, clusters, schedulers, and network fabrics
- Define health models that help customers quickly identify degraded nodes, performance anomalies, and cluster bottlenecks at fleet scale
- Ensure new infrastructure and platform launches are observable by design through strong partnership with compute, network, and platform teams
- Stay current on modern observability stacks and AI infrastructure trends, including how GPU workloads change performance analysis, cost attribution, and operational workflows
View Full Description & ApplyYou'll be redirected to the employer's site