- Lead critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
- Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
- Act as the liaison between leadership and external teams during incidents.
- Own the incident response lifecycle from triage to post-incident review.
- Work in an On-Call Rotation to respond to and coordinate incidents.
- Lead post-incident reviews (PIRs) and root cause analysis.
- Track incident metrics including MTTR, MTTD, and incident recurrence rates.
- Contribute to runbooks, operational standards, and reliability frameworks.
JiraPrometheusNetworking+2 more