Senior Technical Product Manager, Observability

New
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeSenior
Salary130,000 - 165,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
7+ years
Required Skills
KubernetesProduct ManagementDistributed Systems

Requirements

  • 7+ years of product management experience in cloud infrastructure, observability, monitoring, or developer platforms
  • Deep understanding of observability and monitoring systems, including metrics, logging, tracing, alerting, and telemetry pipeline architecture
  • Experience defining product strategy and roadmaps for platform or infrastructure products at scale
  • Strong technical background — ability to engage with engineering on telemetry agents, data models, query engines, retention, and distributed systems
  • Experience with GPU, AI/ML, or HPC infrastructure monitoring and the unique observability challenges of training and inference workloads
  • Track record of shipping developer- and operator-facing products with measurable impact on reliability, time-to-detect, or operational efficiency
  • Experience working across cross-functional teams (engineering, design, marketing, sales) in a fast-paced environment
  • Excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)

Responsibilities

  • Own the end-to-end Observability Platform roadmap across telemetry ingestion, querying, visualization, alerting, and retention for large-scale GPU clusters and multi-tenant cloud environments
  • Define Vultr's observability strategy across bare metal, VMs, Kubernetes, and managed services, aligned to infrastructure roadmap, reliability goals, and customer experience
  • Drive the customer-facing observability surface across dashboards, APIs, telemetry pipelines, and topology-aware insights
  • Translate low-level signals across GPU, CPU, memory, storage, and network into actionable health views, alerts, and debugging workflows for customers
  • Work closely with engineering on technical tradeoffs across metrics agents, collectors, data models, telemetry pipelines, APIs, and retention architecture
  • Build products for distributed AI environments by understanding how training and inference workloads behave across nodes, clusters, schedulers, and network fabrics
  • Define health models that help customers quickly identify degraded nodes, performance anomalies, and cluster bottlenecks at fleet scale
  • Ensure new infrastructure and platform launches are observable by design through strong partnership with compute, network, and platform teams
  • Stay current on modern observability stacks and AI infrastructure trends, including how GPU workloads change performance analysis, cost attribution, and operational workflows
View Full Description & ApplyYou'll be redirected to the employer's site
130,000 - 165,000 USD per year
Apply Now