Senior Incident Manager
L
LambdaAI Cloud Infrastructure
Remote, USAFull-TimeSenior
Salary125,000 - 195,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- JiraPrometheusNetworkingServiceNowDatadog
Requirements
- 8+ years experience in incident management, site reliability engineering, or infrastructure operations.
- Experience managing incidents in large-scale distributed infrastructure environments.
- Deep understanding of data center operations, GPU compute clusters, networking, and storage infrastructure.
- Proven ability to lead high-pressure incident response situations.
- Experience with incident management frameworks (ITIL, SRE, or equivalent).
- Excellent communication and stakeholder management skills.
- Proficiency with incident tracking and monitoring tools (PagerDuty, ServiceNow, Jira, Datadog, Prometheus, Grafana).
Responsibilities
- Lead critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
- Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
- Act as the liaison between leadership and external teams during incidents.
- Own the incident response lifecycle from triage to post-incident review.
- Work in an On-Call Rotation to respond to and coordinate incidents.
- Lead post-incident reviews (PIRs) and root cause analysis.
- Track incident metrics including MTTR, MTTD, and incident recurrence rates.
- Contribute to runbooks, operational standards, and reliability frameworks.
View Full Description & ApplyYou'll be redirected to the employer's site