- Monitor critical production systems including Azure Kubernetes Service (AKS), microservices, and CI/CD pipelines.
- Act as the primary technical responder for live production incidents and Slack escalations.
- Perform root-cause identification and resolve production incidents.
- Maintain, refine, and improve internal runbooks and standard operating procedures (SOPs).
- Oversee and support deployment activities across production and non-production environments.
- Collaborate with engineering teams to resolve systemic issues and elevate platform reliability.
- Design and implement automation scripts to reduce manual operational toil.
PythonBashKubernetes+3 more