- Design and evolve reliability architecture
- Define SLIs/SLOs
- Lead incident response
- Conduct root cause analysis
- Maintain observability systems
- Build automation for deployment safety
- Collaborate with security teams
- Drive operational maturity through runbooks
- Resilience testing
- Capacity planning for mission-critical autonomy and data-intensive workloads.
PythonKubernetesGo+2 more