Senior Technical Product Manager, Observability

New

VultrCloud Infrastructure

Remote - United StatesFull-TimeSenior

Salary130,000 - 165,000 USD per year

Apply NowOpens the employer's application page

Job Details

7+ years of product management experience in cloud infrastructure, observability, monitoring, or developer platforms
Deep understanding of observability and monitoring systems, including metrics, logging, tracing, alerting, and telemetry pipeline architecture
Experience defining product strategy and roadmaps for platform or infrastructure products at scale
Strong technical background — ability to engage with engineering on telemetry agents, data models, query engines, retention, and distributed systems
Experience with GPU, AI/ML, or HPC infrastructure monitoring and the unique observability challenges of training and inference workloads
Track record of shipping developer- and operator-facing products with measurable impact on reliability, time-to-detect, or operational efficiency
Experience working across cross-functional teams (engineering, design, marketing, sales) in a fast-paced environment
Excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences
Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)

Own the end-to-end Observability Platform roadmap across telemetry ingestion, querying, visualization, alerting, and retention for large-scale GPU clusters and multi-tenant cloud environments
Define Vultr's observability strategy across bare metal, VMs, Kubernetes, and managed services, aligned to infrastructure roadmap, reliability goals, and customer experience
Drive the customer-facing observability surface across dashboards, APIs, telemetry pipelines, and topology-aware insights
Translate low-level signals across GPU, CPU, memory, storage, and network into actionable health views, alerts, and debugging workflows for customers
Work closely with engineering on technical tradeoffs across metrics agents, collectors, data models, telemetry pipelines, APIs, and retention architecture
Build products for distributed AI environments by understanding how training and inference workloads behave across nodes, clusters, schedulers, and network fabrics
Define health models that help customers quickly identify degraded nodes, performance anomalies, and cluster bottlenecks at fleet scale
Ensure new infrastructure and platform launches are observable by design through strong partnership with compute, network, and platform teams
Stay current on modern observability stacks and AI infrastructure trends, including how GPU workloads change performance analysis, cost attribution, and operational workflows

View Full Description & ApplyYou'll be redirected to the employer's site