Incident and Escalation Manager
New
Based in the United StatesFull-TimeManager
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 12+ years
- Required Skills
- Artificial IntelligenceSalesforceJiraSlackDistributed Systems
Requirements
- 12+ years of experience in Incident Management, Escalation Management, Problem Management, or Technical Operations in enterprise or high-tech environments.
- Proven experience leading high-severity incidents and executive escalations in AI, HPC, or large-scale infrastructure ecosystems.
- Strong technical understanding of complex distributed systems and ability to collaborate effectively with engineering teams under pressure.
- Deep knowledge of ITIL frameworks, including Incident, Problem, Change, and Escalation Management practices.
- Exceptional communication skills, with the ability to manage both technical and executive-level audiences.
- Strong analytical mindset with experience interpreting incident data, trends, and operational metrics.
- Ability to operate in high-pressure, customer-facing situations with strong ownership and decision-making capabilities.
- Experience working in global, 24/7 operational environments with on-call responsibilities.
- Proven ability to influence cross-functional teams and senior stakeholders without direct authority.
Responsibilities
- Lead and coordinate major incident response efforts for high-severity service disruptions impacting AI, HPC, and enterprise-scale environments.
- Act as Incident Commander, driving structured triage, cross-functional collaboration, real-time decision-making, and service restoration activities.
- Manage executive-level escalations, ensuring rapid resolution of critical customer issues and maintaining strong stakeholder alignment.
- Provide clear, timely, and structured communication to executives, customers, and internal teams during major incidents.
- Partner with engineering, support, product, and sales teams to resolve complex technical and service-related challenges.
- Lead post-incident and escalation reviews (PIER), including root cause analysis and corrective action tracking.
- Identify systemic issues and drive continuous improvement across incident, escalation, and problem management processes.
- Contribute to the development of operational frameworks, governance models, and service reliability standards across global teams.
View Full Description & ApplyYou'll be redirected to the employer's site