Senior Incident Manager

LambdaAI Cloud Infrastructure

Remote, USAFull-TimeSenior

Salary125,000 - 195,000 USD per year

Apply NowOpens the employer's application page

Job Details

8+ years experience in incident management, site reliability engineering, or infrastructure operations.
Experience managing incidents in large-scale distributed infrastructure environments.
Deep understanding of data center operations, GPU compute clusters, networking, and storage infrastructure.
Proven ability to lead high-pressure incident response situations.
Experience with incident management frameworks (ITIL, SRE, or equivalent).
Excellent communication and stakeholder management skills.
Proficiency with incident tracking and monitoring tools (PagerDuty, ServiceNow, Jira, Datadog, Prometheus, Grafana).

Lead critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
Act as the liaison between leadership and external teams during incidents.
Own the incident response lifecycle from triage to post-incident review.
Work in an On-Call Rotation to respond to and coordinate incidents.
Lead post-incident reviews (PIRs) and root cause analysis.
Track incident metrics including MTTR, MTTD, and incident recurrence rates.
Contribute to runbooks, operational standards, and reliability frameworks.

View Full Description & ApplyYou'll be redirected to the employer's site