Incident and Escalation Manager

New

Based in the United StatesFull-TimeManager

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 12+ years
Required Skills: Artificial IntelligenceSalesforceJiraSlackDistributed Systems

12+ years of experience in Incident Management, Escalation Management, Problem Management, or Technical Operations in enterprise or high-tech environments.
Proven experience leading high-severity incidents and executive escalations in AI, HPC, or large-scale infrastructure ecosystems.
Strong technical understanding of complex distributed systems and ability to collaborate effectively with engineering teams under pressure.
Deep knowledge of ITIL frameworks, including Incident, Problem, Change, and Escalation Management practices.
Exceptional communication skills, with the ability to manage both technical and executive-level audiences.
Strong analytical mindset with experience interpreting incident data, trends, and operational metrics.
Ability to operate in high-pressure, customer-facing situations with strong ownership and decision-making capabilities.
Experience working in global, 24/7 operational environments with on-call responsibilities.
Proven ability to influence cross-functional teams and senior stakeholders without direct authority.

Lead and coordinate major incident response efforts for high-severity service disruptions impacting AI, HPC, and enterprise-scale environments.
Act as Incident Commander, driving structured triage, cross-functional collaboration, real-time decision-making, and service restoration activities.
Manage executive-level escalations, ensuring rapid resolution of critical customer issues and maintaining strong stakeholder alignment.
Provide clear, timely, and structured communication to executives, customers, and internal teams during major incidents.
Partner with engineering, support, product, and sales teams to resolve complex technical and service-related challenges.
Lead post-incident and escalation reviews (PIER), including root cause analysis and corrective action tracking.
Identify systemic issues and drive continuous improvement across incident, escalation, and problem management processes.
Contribute to the development of operational frameworks, governance models, and service reliability standards across global teams.

View Full Description & ApplyYou'll be redirected to the employer's site