Apply

Senior Manager, Engineering - SRE Network Operations Center (NOC)

Posted 3 months agoViewed

View full description

💎 Seniority level: Senior, 10+ years

📍 Location: United States of America

💸 Salary: $130,000 - $260,000 per year

🔍 Industry: Insurance

🏢 Company: external

🗣️ Languages: English

⏳ Experience: 10+ years

🪄 Skills: LeadershipPythonSQLAgileBashCloud ComputingJavaSCRUMAzureGrafanaPrometheusCommunication SkillsNetworkingTroubleshootingScripting

Requirements:
  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • Cloud Certifications are a plus (preferably Azure or AWS).
  • 10+ years of hands-on work experience supervising personnel in a technical environment.
  • Excellent troubleshooting skills and a thorough understanding of Operations, Incident Management, systems engineering, and infrastructure support.
  • Must have experience leading a team in a fast-paced environment.
  • Excellent verbal and written communication skills; technical writing skills required.
  • Overall understanding of cloud computing and other internet technologies.
  • Ability to facilitate resolution of multiple incidents.
  • Understanding of incident management and infrastructure systems/tools.
  • Experience with Agile methodologies such as Kanban and Scrum.
  • Knowledge of observability tools such as Splunk, Dynatrace, etc.
  • Strong IT and network systems troubleshooting skills.
  • Good understanding of Cloud Computing technologies and concepts (SaaS, PaaS, IaaS, etc.).
  • Azure Fundamentals (AZ 900) Certification is a plus.
  • Experience with shell-scripting languages and full stack engineering.
  • Strong problem-solving, time management, flexibility, and communication skills.
Responsibilities:
  • Define the strategic direction for the NOC with a focus on adopting and embedding SRE practices across all operational processes.
  • Lead a team of 15+ Incident Response SRE engineers, providing guidance, mentorship, and support to ensure high performance.
  • Serve as the ultimate incident commander during critical incidents, overseeing incident communication and ensuring stakeholder alignment.
  • Develop and maintain a robust schedule that ensures 24/7/365 coverage by the NOC SRE team.
  • Drive the adoption of SRE practices and implement changes that reduce toil, enhance reliability, and improve incident response.
  • Oversee incident communication reports to executives and key stakeholders.
  • Oversee the development and maintenance of observability dashboards using Grafana and Prometheus.
  • Establish and maintain a robust notification, escalation, and paging process.
  • Plan and execute regular simulations and dry runs to build muscle memory for incident response.
  • Oversee NOC SRE backlog management, prioritizing improvements for high-impact tasks.
Apply