Principal Observability & Reliability Architect

New
United StatesFull-TimePrincipal
Salary180,000 - 240,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
10+ years in observability, monitoring, APM, platform operations, SRE, or related enterprise technology domains, including 5+ years leading architecture and delivery strategy.
Required Skills
Cloud ComputingKubernetesCI/CD

Requirements

  • 10+ years in observability, monitoring, APM, platform operations, SRE, or related enterprise technology domains.
  • 5+ years leading architecture and delivery strategy for enterprise observability or reliability initiatives.
  • Hands-on experience designing and implementing across monitoring, logging, metrics, tracing, telemetry collection, and pipeline patterns in hybrid and multi-cloud environments.
  • Expertise in telemetry governance: routing, transformation, normalization, enrichment, retention, access control, and cost management.
  • Experience defining enterprise standards for dashboards, alerts, tagging, naming, service ownership, RBAC, and operating model adoption.
  • Strong command of incident response, event correlation, alert strategy, service health, and business-service visibility.
  • Applied knowledge of SRE concepts including SLIs, SLOs, error budgets, and production readiness.
  • Ability to lead executive and technical workshops and translate business needs into actionable architecture and delivery plans.
  • Consulting or professional services experience with strong client-facing communication, estimation, risk management, and cross-functional leadership.

Responsibilities

  • Lead client discovery, architecture workshops, and solution design across observability, telemetry, reliability, and operational intelligence initiatives.
  • Design enterprise observability architectures spanning monitoring, logging, metrics, tracing, telemetry pipelines, alerting, event correlation, service visibility, and platform integrations.
  • Define scalable standards for telemetry onboarding, naming, tagging, RBAC, service ownership, dashboards, alert governance, runbooks, and operational handoff.
  • Advise on telemetry governance, including data quality, retention, access control, sampling, cardinality, and cost optimization.
  • Lead modernization initiatives including tool rationalization, dashboard and alert rationalization, telemetry strategy, and migration from legacy monitoring platforms.
  • Guide SRE practices including SLIs, SLOs, error budgets, production readiness, and incident response maturity.
  • Design integration patterns across ITSM, CMDB, event management, and automation platforms.
  • Support pursuits by shaping solution strategy, validating scope, informing estimates, and building client-facing technical narratives.
  • Serve as a senior escalation point and provide architecture governance during delivery.
  • Build reusable reference architectures, playbooks, and accelerators while mentoring architects, consultants, and offshore teams.
View Full Description & ApplyYou'll be redirected to the employer's site
180,000 - 240,000 USD per year
Apply Now