Architect and operate resilient cloud infrastructure (AWS, Pulumi, Kubernetes) Lead initiatives to improve availability, latency, and performance at scale Design and evolve CI/CD pipelines Define metrics, alerts, and runbooks for observability Run chaos experiments and failure simulations Mentor engineers and set SRE best practices