Senior Site Reliability Engineer

New

UK / EuropeContractSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: AWSPythonKubernetesGoGrafanaPrometheusCI/CDTerraformCloudFormation

5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or similar infrastructure-focused roles.
Strong hands-on expertise with AWS cloud services; experience with Azure or GCP is considered an advantage.
Proven experience using Infrastructure as Code tools such as Terraform or CloudFormation.
Solid understanding of CI/CD pipelines, automation practices, and Git-based development workflows.
Experience implementing and managing reliability frameworks including SLIs, SLOs, SLAs, and error budgets.
Practical knowledge of observability and monitoring tools such as Prometheus, Grafana, ELK/EFK, OpenTelemetry, and distributed tracing solutions.
Scripting or programming skills in Python, Go, Bash, or PowerShell.
Strong understanding of networking concepts including VPCs, VPNs, load balancers, and firewalls.
Familiarity with cloud security principles, compliance frameworks, and operational best practices.
Excellent troubleshooting, communication, and stakeholder management skills within global and cross-functional environments.

Lead site reliability and operational excellence initiatives across production systems and cloud-based services.
Define, implement, and manage reliability metrics including SLIs, SLOs, SLAs, and error budgets to ensure platform stability and performance.
Design and maintain scalable, resilient cloud-native architectures with a strong focus on automation and infrastructure reliability.
Build and optimize Infrastructure as Code and CI/CD pipelines to improve deployment efficiency and consistency.
Develop and maintain monitoring, logging, tracing, and alerting capabilities to enhance system observability and proactive incident response.
Drive incident management processes, including troubleshooting, root cause analysis, post-incident reviews, and preventive improvements.
Collaborate with cross-functional global teams across engineering, security, product, and vendor management functions.
Support operational maturity initiatives through documentation, runbooks, automation, and continuous process optimization.
Mentor engineers and contribute to technical knowledge sharing and reliability best practices across teams.
Perform capacity planning, system performance analysis, and reliability assessments to ensure long-term scalability.

View Full Description & ApplyYou'll be redirected to the employer's site