Site Reliability Engineering Manager

Posted 4 days agoViewed

💎 Seniority level: Manager

📍 Location: United Kingdom

🗣️ Languages: English

🪄 Skills: AWSBackend DevelopmentDockerPostgreSQLPythonSQLCloud ComputingGCPJavaJava EEJenkinsKafkaKubernetesSpring BootSpring MVCZabbixAlgorithmsAzureData StructuresGoGrafanaJava SpringPrometheusRDBMSCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingAnsibleScriptingDebugging

Requirements:

Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems.
Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures.
Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures. Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources.
Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures.
Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders. Commitment to collaborative problem-solving and shared ownership of services.
Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems

Responsibilities:

Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention.
Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents.
Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution.
Work alongside developers to ensure the quality, scalability, and reliability of our services. Practice shared ownership of services in production, fostering a "You build it, you run it" culture.
Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively.
Strong understanding of common application reliability patterns, with hands-on experience implementing them.
Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence. Champion a culture of continuous improvement.
Evaluate system performance and advocate for optimisations that reduce infrastructure costs while maintaining service reliability.

Apply