Senior Incident Manager

L
LambdaAI Cloud Infrastructure
Remote, USAFull-TimeSenior
Salary125,000 - 195,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
8+ years
Required Skills
JiraPrometheusNetworkingServiceNowDatadog

Requirements

  • 8+ years experience in incident management, site reliability engineering, or infrastructure operations.
  • Experience managing incidents in large-scale distributed infrastructure environments.
  • Deep understanding of data center operations, GPU compute clusters, networking, and storage infrastructure.
  • Proven ability to lead high-pressure incident response situations.
  • Experience with incident management frameworks (ITIL, SRE, or equivalent).
  • Excellent communication and stakeholder management skills.
  • Proficiency with incident tracking and monitoring tools (PagerDuty, ServiceNow, Jira, Datadog, Prometheus, Grafana).

Responsibilities

  • Lead critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
  • Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
  • Act as the liaison between leadership and external teams during incidents.
  • Own the incident response lifecycle from triage to post-incident review.
  • Work in an On-Call Rotation to respond to and coordinate incidents.
  • Lead post-incident reviews (PIRs) and root cause analysis.
  • Track incident metrics including MTTR, MTTD, and incident recurrence rates.
  • Contribute to runbooks, operational standards, and reliability frameworks.
View Full Description & ApplyYou'll be redirected to the employer's site
125,000 - 195,000 USD per year
Apply Now