- Take end-to-end responsibility for critical reliability areas and drive technical direction.
- Drive architecture and evolution of Azure cloud infrastructure and Kubernetes clusters.
- Define resilience strategies including global scaling, zero-downtime deployments, and disaster recovery.
- Optimize LGTM observability stack (Loki, Grafana, Tempo, Mimir).
- Improve IaC platform and enable self-service for engineering teams.
- Lead incident response and conduct blameless post-mortems.
- Mentor team members and lead RFCs and design reviews.
PythonKubernetesAzure+2 more