Principal Observability & Reliability Architect
New
Based in United StatesFull-TimePrincipal
SalaryCompetitive On-Target Earnings (OTE) package including base salary and performance-based incentives, determined by experience and location.
Apply NowOpens the employer's application page
Job Details
- Experience
- 10+ years of experience in observability, platform operations, SRE, monitoring, APM, or related enterprise infrastructure domains, including 5+ years in architecture or technical leadership roles.
- Required Skills
- ServiceNowDatadog
Requirements
- 10+ years of experience in observability, platform operations, SRE, monitoring, APM, or related enterprise infrastructure domains, including 5+ years in architecture or technical leadership roles.
- Strong hands-on expertise designing and implementing observability solutions across metrics, logs, traces, telemetry pipelines, and distributed systems in cloud and hybrid environments.
- Deep understanding of telemetry governance frameworks, including data normalization, enrichment, routing, retention strategies, access control, and cost optimization.
- Proven ability to define enterprise standards for dashboards, alerts, service tagging, naming conventions, RBAC, and operational maturity models.
- Strong SRE background with practical experience implementing SLIs, SLOs, error budgets, incident response processes, and production reliability practices.
- Experience integrating observability platforms with ITSM and operational tools such as ServiceNow, PagerDuty, Jira Service Management, or similar ecosystems.
- Consulting or professional services experience with strong client-facing communication, workshop facilitation, estimation, and cross-functional leadership skills.
- Ability to translate complex technical challenges into clear, actionable architecture and delivery plans for both technical and executive audiences.
- Experience with platforms such as Datadog, Dynatrace, Splunk, Grafana, New Relic, Prometheus, or OpenTelemetry is highly desirable.
- Familiarity with telemetry pipeline tools such as Kafka, Fluent Bit, OpenTelemetry Collector, or similar technologies is a strong plus.
- Experience building reusable consulting assets such as reference architectures, accelerators, and governance frameworks is preferred.
Responsibilities
- Lead discovery sessions, architecture workshops, and solution design activities across observability, reliability, telemetry, and operational intelligence programs for enterprise clients.
- Design end-to-end observability architectures covering monitoring, logging, metrics, tracing, event correlation, alerting, telemetry pipelines, and platform integrations across hybrid and multi-cloud environments.
- Define and enforce enterprise standards for telemetry governance, including naming conventions, tagging, RBAC, data quality, retention, sampling, cost optimization, and service ownership models.
- Guide modernization initiatives such as tool consolidation, dashboard and alert rationalization, migration from legacy monitoring systems, and implementation of scalable observability platforms.
- Establish and mature SRE practices including SLIs, SLOs, error budgets, production readiness reviews, and incident response frameworks to improve operational reliability.
- Design integration patterns across ITSM, CMDB, event management, automation, and incident response platforms to ensure seamless operational workflows.
- Support pre-sales and pursuit activities by shaping solution strategy, validating scope, developing estimates, and creating client-facing technical narratives.
- Act as a senior escalation point during delivery, providing architecture governance, risk mitigation guidance, and technical oversight across engagements.
- Develop reusable assets including reference architectures, playbooks, governance models, and accelerators while mentoring architects, consultants, and delivery teams.
View Full Description & ApplyYou'll be redirected to the employer's site