Define and drive reliability strategy: establish SLOs and ensure alignment across engineering. Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers. Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents. Architect for reliability and observability: influence system design for redundancy, durability, and debuggability. Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection. Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service. Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights. Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems.