Principal Observability & Reliability Architect
New
United StatesFull-TimePrincipal
Salary180,000 - 240,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 10+ years in observability, monitoring, APM, platform operations, SRE, or related enterprise technology domains, including 5+ years leading architecture and delivery strategy.
- Required Skills
- Cloud ComputingKubernetesCI/CD
Requirements
- 10+ years in observability, monitoring, APM, platform operations, SRE, or related enterprise technology domains.
- 5+ years leading architecture and delivery strategy for enterprise observability or reliability initiatives.
- Hands-on experience designing and implementing across monitoring, logging, metrics, tracing, telemetry collection, and pipeline patterns in hybrid and multi-cloud environments.
- Expertise in telemetry governance: routing, transformation, normalization, enrichment, retention, access control, and cost management.
- Experience defining enterprise standards for dashboards, alerts, tagging, naming, service ownership, RBAC, and operating model adoption.
- Strong command of incident response, event correlation, alert strategy, service health, and business-service visibility.
- Applied knowledge of SRE concepts including SLIs, SLOs, error budgets, and production readiness.
- Ability to lead executive and technical workshops and translate business needs into actionable architecture and delivery plans.
- Consulting or professional services experience with strong client-facing communication, estimation, risk management, and cross-functional leadership.
Responsibilities
- Lead client discovery, architecture workshops, and solution design across observability, telemetry, reliability, and operational intelligence initiatives.
- Design enterprise observability architectures spanning monitoring, logging, metrics, tracing, telemetry pipelines, alerting, event correlation, service visibility, and platform integrations.
- Define scalable standards for telemetry onboarding, naming, tagging, RBAC, service ownership, dashboards, alert governance, runbooks, and operational handoff.
- Advise on telemetry governance, including data quality, retention, access control, sampling, cardinality, and cost optimization.
- Lead modernization initiatives including tool rationalization, dashboard and alert rationalization, telemetry strategy, and migration from legacy monitoring platforms.
- Guide SRE practices including SLIs, SLOs, error budgets, production readiness, and incident response maturity.
- Design integration patterns across ITSM, CMDB, event management, and automation platforms.
- Support pursuits by shaping solution strategy, validating scope, informing estimates, and building client-facing technical narratives.
- Serve as a senior escalation point and provide architecture governance during delivery.
- Build reusable reference architectures, playbooks, and accelerators while mentoring architects, consultants, and offshore teams.
View Full Description & ApplyYou'll be redirected to the employer's site