Lambda

Private Company
ShareTweet

Open Positions2

Remote, USAFull-TimeAI Cloud InfrastructurePosted
  • Lead critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
  • Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
  • Act as the liaison between leadership and external teams during incidents.
  • Own the incident response lifecycle from triage to post-incident review.
  • Work in an On-Call Rotation to respond to and coordinate incidents.
  • Lead post-incident reviews (PIRs) and root cause analysis.
  • Track incident metrics including MTTR, MTTD, and incident recurrence rates.
  • Contribute to runbooks, operational standards, and reliability frameworks.
JiraPrometheusNetworking+2 more
Showing 1 of 2 positions

Similar Companies