Principal Observability & Reliability Architect

New

Based in United StatesFull-TimePrincipal

SalaryCompetitive On-Target Earnings (OTE) package including base salary and performance-based incentives, determined by experience and location.

Apply NowOpens the employer's application page

Job Details

Experience: 10+ years of experience in observability, platform operations, SRE, monitoring, APM, or related enterprise infrastructure domains, including 5+ years in architecture or technical leadership roles.
Required Skills: ServiceNowDatadog

Requirements

10+ years of experience in observability, platform operations, SRE, monitoring, APM, or related enterprise infrastructure domains, including 5+ years in architecture or technical leadership roles.
Strong hands-on expertise designing and implementing observability solutions across metrics, logs, traces, telemetry pipelines, and distributed systems in cloud and hybrid environments.
Deep understanding of telemetry governance frameworks, including data normalization, enrichment, routing, retention strategies, access control, and cost optimization.
Proven ability to define enterprise standards for dashboards, alerts, service tagging, naming conventions, RBAC, and operational maturity models.
Strong SRE background with practical experience implementing SLIs, SLOs, error budgets, incident response processes, and production reliability practices.
Experience integrating observability platforms with ITSM and operational tools such as ServiceNow, PagerDuty, Jira Service Management, or similar ecosystems.
Consulting or professional services experience with strong client-facing communication, workshop facilitation, estimation, and cross-functional leadership skills.
Ability to translate complex technical challenges into clear, actionable architecture and delivery plans for both technical and executive audiences.
Experience with platforms such as Datadog, Dynatrace, Splunk, Grafana, New Relic, Prometheus, or OpenTelemetry is highly desirable.
Familiarity with telemetry pipeline tools such as Kafka, Fluent Bit, OpenTelemetry Collector, or similar technologies is a strong plus.
Experience building reusable consulting assets such as reference architectures, accelerators, and governance frameworks is preferred.

Responsibilities

Lead discovery sessions, architecture workshops, and solution design activities across observability, reliability, telemetry, and operational intelligence programs for enterprise clients.
Design end-to-end observability architectures covering monitoring, logging, metrics, tracing, event correlation, alerting, telemetry pipelines, and platform integrations across hybrid and multi-cloud environments.
Define and enforce enterprise standards for telemetry governance, including naming conventions, tagging, RBAC, data quality, retention, sampling, cost optimization, and service ownership models.
Guide modernization initiatives such as tool consolidation, dashboard and alert rationalization, migration from legacy monitoring systems, and implementation of scalable observability platforms.
Establish and mature SRE practices including SLIs, SLOs, error budgets, production readiness reviews, and incident response frameworks to improve operational reliability.
Design integration patterns across ITSM, CMDB, event management, automation, and incident response platforms to ensure seamless operational workflows.
Support pre-sales and pursuit activities by shaping solution strategy, validating scope, developing estimates, and creating client-facing technical narratives.
Act as a senior escalation point during delivery, providing architecture governance, risk mitigation guidance, and technical oversight across engagements.
Develop reusable assets including reference architectures, playbooks, governance models, and accelerators while mentoring architects, consultants, and delivery teams.

View Full Description & ApplyYou'll be redirected to the employer's site