- Build and maintain observability across our platform in Datadog: dashboards, monitors, APM, log pipelines, and meaningful, low-noise alerting.
- Define and track SLIs, SLOs, and error budgets for specific services.
- Participate in the on-call rotation and serve as an SRE Partner during incidents.
- Drive incident response per our framework, keeping clear, real-time documentation of status, findings, and decisions.
- Contribute actively to blameless post-incident reviews (RCAs).
- Automate toil by building scripts, tooling, and self-healing mechanisms.
- Leverage AI tools (e.g., Claude, Cursor) to accelerate debugging, runbook maintenance, and RCA drafting.
- Maintain and improve SRE runbooks and triage workflows.
AWSPythonBash+3 more