Staff Production Operations Engineer
New
Based in United StatesFull-TimeStaff
SalaryCompetitive compensation aligned with senior infrastructure engineering roles in the US market
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSGCPAzureGoGrafanaDevOpsDatadog
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations in large-scale distributed environments.
- Strong experience with incident management platforms such as PagerDuty, incident.io, or similar tools.
- Hands-on expertise with observability stacks including Datadog, Grafana, CloudWatch, Sentry, or equivalents.
- Strong understanding of reliability engineering principles such as SLOs, SLIs, MTTR, MTTA, and error budgets.
- Experience building automation, tooling, or systems to reduce operational toil and improve engineering efficiency.
- Proficiency in Go or another systems programming language with the ability to contribute to production codebases.
- Familiarity with cloud environments (AWS, Azure, or GCP) and infrastructure-as-code practices.
- Experience leveraging AI-assisted development tools to improve workflows and operational processes.
- Strong written communication skills with the ability to coordinate across teams without direct authority.
Responsibilities
- Drive end-to-end improvements across the incident lifecycle, including alerting quality, severity classification, triage processes, and post-incident follow-ups.
- Coordinate on-call programs across distributed teams, including scheduling, onboarding, and ensuring consistent operational coverage.
- Lead incident reviews, identify root causes, and ensure actionable follow-ups are tracked and completed effectively.
- Build and deploy automation and AI-driven agents to reduce operational toil, including incident summarization and on-call optimization.
- Maintain and evolve runbooks, playbooks, and operational documentation to reflect current system behavior and best practices.
- Partner with engineering and product teams to improve system observability, reliability metrics, and operational readiness.
- Contribute directly to incident resolution when needed by debugging, prototyping fixes, or implementing mitigation strategies.
- Improve monitoring, alerting, and observability systems to reduce noise and increase signal quality across production environments.
View Full Description & ApplyYou'll be redirected to the employer's site