Senior Site Reliability Engineer
New
UK / EuropeContractSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonKubernetesGoGrafanaPrometheusCI/CDTerraformCloudFormation
Requirements
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or similar infrastructure-focused roles.
- Strong hands-on expertise with AWS cloud services; experience with Azure or GCP is considered an advantage.
- Proven experience using Infrastructure as Code tools such as Terraform or CloudFormation.
- Solid understanding of CI/CD pipelines, automation practices, and Git-based development workflows.
- Experience implementing and managing reliability frameworks including SLIs, SLOs, SLAs, and error budgets.
- Practical knowledge of observability and monitoring tools such as Prometheus, Grafana, ELK/EFK, OpenTelemetry, and distributed tracing solutions.
- Scripting or programming skills in Python, Go, Bash, or PowerShell.
- Strong understanding of networking concepts including VPCs, VPNs, load balancers, and firewalls.
- Familiarity with cloud security principles, compliance frameworks, and operational best practices.
- Excellent troubleshooting, communication, and stakeholder management skills within global and cross-functional environments.
Responsibilities
- Lead site reliability and operational excellence initiatives across production systems and cloud-based services.
- Define, implement, and manage reliability metrics including SLIs, SLOs, SLAs, and error budgets to ensure platform stability and performance.
- Design and maintain scalable, resilient cloud-native architectures with a strong focus on automation and infrastructure reliability.
- Build and optimize Infrastructure as Code and CI/CD pipelines to improve deployment efficiency and consistency.
- Develop and maintain monitoring, logging, tracing, and alerting capabilities to enhance system observability and proactive incident response.
- Drive incident management processes, including troubleshooting, root cause analysis, post-incident reviews, and preventive improvements.
- Collaborate with cross-functional global teams across engineering, security, product, and vendor management functions.
- Support operational maturity initiatives through documentation, runbooks, automation, and continuous process optimization.
- Mentor engineers and contribute to technical knowledge sharing and reliability best practices across teams.
- Perform capacity planning, system performance analysis, and reliability assessments to ensure long-term scalability.
View Full Description & ApplyYou'll be redirected to the employer's site