- Define and enforce SLOs, SLIs, and SLAs across production
- Monitor system health, plan capacity, and automate deployments, patching, and infrastructure provisioning
- Diagnose and resolve production incidents fast – including 24/7 on-call participation
- Lead post-mortems and turn findings into prevention; maintain runbooks and escalation procedures
- Manage cloud infrastructure via IaC and own CI/CD pipeline design and maintenance
- Drive scalability, fault tolerance, disaster recovery, and security compliance
- Partner with Dev teams on production readiness, Shift Left practices, and error budget management
KubernetesGoCI/CD+2 more