Design, implement, and manage scalable systems that ensure high availability, fault tolerance, and optimal performance. Continuously monitor and enhance system health and performance through data analysis and metrics. Embed observability (metrics, logs, traces, alerts) with actionable thresholds and up-to-date runbooks. Eliminate toil by building automation and self-service tools for common operational workflows. Own CI/CD pipelines (build, test, security scans) and enable progressive delivery (blue/green, canary). Manage infrastructure as code via Terraform and configuration management with Git-backed workflows. Participate in on-call; triage, mitigate, and resolve incidents within defined SLAs. Lead incident response and blameless post-incident reviews; document RCAs and drive corrective actions to closure. Maintain runbooks/playbooks and regularly perform disaster recovery scenarios. Operate and secure AWS environments (IAM, VPC, EC2/ECS, RDS, S3, Lambda, etc.) with a focus on resilience and compliance. Optimize cost, performance, and reliability (rightsizing, autoscaling, reservations/savings plans, tagging, spend monitoring, etc.). Serve as a technical advisor to engineering teams on infrastructure and operations best practices. Mentor peers on SRE practices; promote observability, continuous improvement, and a blameless culture. Contribute to roadmaps and capacity planning to align reliability goals with product objectives.