Staff Production Operations Engineer

New

Based in United StatesFull-TimeStaff

SalaryCompetitive compensation aligned with senior infrastructure engineering roles in the US market

Apply NowOpens the employer's application page

Job Details

5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations in large-scale distributed environments.
Strong experience with incident management platforms such as PagerDuty, incident.io, or similar tools.
Hands-on expertise with observability stacks including Datadog, Grafana, CloudWatch, Sentry, or equivalents.
Strong understanding of reliability engineering principles such as SLOs, SLIs, MTTR, MTTA, and error budgets.
Experience building automation, tooling, or systems to reduce operational toil and improve engineering efficiency.
Proficiency in Go or another systems programming language with the ability to contribute to production codebases.
Familiarity with cloud environments (AWS, Azure, or GCP) and infrastructure-as-code practices.
Experience leveraging AI-assisted development tools to improve workflows and operational processes.
Strong written communication skills with the ability to coordinate across teams without direct authority.

Drive end-to-end improvements across the incident lifecycle, including alerting quality, severity classification, triage processes, and post-incident follow-ups.
Coordinate on-call programs across distributed teams, including scheduling, onboarding, and ensuring consistent operational coverage.
Lead incident reviews, identify root causes, and ensure actionable follow-ups are tracked and completed effectively.
Build and deploy automation and AI-driven agents to reduce operational toil, including incident summarization and on-call optimization.
Maintain and evolve runbooks, playbooks, and operational documentation to reflect current system behavior and best practices.
Partner with engineering and product teams to improve system observability, reliability metrics, and operational readiness.
Contribute directly to incident resolution when needed by debugging, prototyping fixes, or implementing mitigation strategies.
Improve monitoring, alerting, and observability systems to reduce noise and increase signal quality across production environments.

View Full Description & ApplyYou'll be redirected to the employer's site