Staff Production Operations Engineer

New
Based in United StatesFull-TimeStaff
SalaryCompetitive compensation aligned with senior infrastructure engineering roles in the US market
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSGCPAzureGoGrafanaDevOpsDatadog

Requirements

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations in large-scale distributed environments.
  • Strong experience with incident management platforms such as PagerDuty, incident.io, or similar tools.
  • Hands-on expertise with observability stacks including Datadog, Grafana, CloudWatch, Sentry, or equivalents.
  • Strong understanding of reliability engineering principles such as SLOs, SLIs, MTTR, MTTA, and error budgets.
  • Experience building automation, tooling, or systems to reduce operational toil and improve engineering efficiency.
  • Proficiency in Go or another systems programming language with the ability to contribute to production codebases.
  • Familiarity with cloud environments (AWS, Azure, or GCP) and infrastructure-as-code practices.
  • Experience leveraging AI-assisted development tools to improve workflows and operational processes.
  • Strong written communication skills with the ability to coordinate across teams without direct authority.

Responsibilities

  • Drive end-to-end improvements across the incident lifecycle, including alerting quality, severity classification, triage processes, and post-incident follow-ups.
  • Coordinate on-call programs across distributed teams, including scheduling, onboarding, and ensuring consistent operational coverage.
  • Lead incident reviews, identify root causes, and ensure actionable follow-ups are tracked and completed effectively.
  • Build and deploy automation and AI-driven agents to reduce operational toil, including incident summarization and on-call optimization.
  • Maintain and evolve runbooks, playbooks, and operational documentation to reflect current system behavior and best practices.
  • Partner with engineering and product teams to improve system observability, reliability metrics, and operational readiness.
  • Contribute directly to incident resolution when needed by debugging, prototyping fixes, or implementing mitigation strategies.
  • Improve monitoring, alerting, and observability systems to reduce noise and increase signal quality across production environments.
View Full Description & ApplyYou'll be redirected to the employer's site
Competitive compensation aligned with senior infrastructure engineering roles in the US market
Apply Now