ApplySenior Manager, Engineering - SRE Network Operations Center (NOC)
Posted 3 months agoViewed
View full description
💎 Seniority level: Senior, 10+ years
📍 Location: United States of America
💸 Salary: $130,000 - $260,000 per year
🔍 Industry: Insurance
🏢 Company: external
🗣️ Languages: English
⏳ Experience: 10+ years
🪄 Skills: LeadershipPythonSQLAgileBashCloud ComputingJavaSCRUMAzureGrafanaPrometheusCommunication SkillsNetworkingTroubleshootingScripting
Requirements:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Cloud Certifications are a plus (preferably Azure or AWS).
- 10+ years of hands-on work experience supervising personnel in a technical environment.
- Excellent troubleshooting skills and a thorough understanding of Operations, Incident Management, systems engineering, and infrastructure support.
- Must have experience leading a team in a fast-paced environment.
- Excellent verbal and written communication skills; technical writing skills required.
- Overall understanding of cloud computing and other internet technologies.
- Ability to facilitate resolution of multiple incidents.
- Understanding of incident management and infrastructure systems/tools.
- Experience with Agile methodologies such as Kanban and Scrum.
- Knowledge of observability tools such as Splunk, Dynatrace, etc.
- Strong IT and network systems troubleshooting skills.
- Good understanding of Cloud Computing technologies and concepts (SaaS, PaaS, IaaS, etc.).
- Azure Fundamentals (AZ 900) Certification is a plus.
- Experience with shell-scripting languages and full stack engineering.
- Strong problem-solving, time management, flexibility, and communication skills.
Responsibilities:
- Define the strategic direction for the NOC with a focus on adopting and embedding SRE practices across all operational processes.
- Lead a team of 15+ Incident Response SRE engineers, providing guidance, mentorship, and support to ensure high performance.
- Serve as the ultimate incident commander during critical incidents, overseeing incident communication and ensuring stakeholder alignment.
- Develop and maintain a robust schedule that ensures 24/7/365 coverage by the NOC SRE team.
- Drive the adoption of SRE practices and implement changes that reduce toil, enhance reliability, and improve incident response.
- Oversee incident communication reports to executives and key stakeholders.
- Oversee the development and maintenance of observability dashboards using Grafana and Prometheus.
- Establish and maintain a robust notification, escalation, and paging process.
- Plan and execute regular simulations and dry runs to build muscle memory for incident response.
- Oversee NOC SRE backlog management, prioritizing improvements for high-impact tasks.
Apply