Senior Site Reliability Engineer

New
UK / EuropeContractSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSPythonKubernetesGoGrafanaPrometheusCI/CDTerraformCloudFormation

Requirements

  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or similar infrastructure-focused roles.
  • Strong hands-on expertise with AWS cloud services; experience with Azure or GCP is considered an advantage.
  • Proven experience using Infrastructure as Code tools such as Terraform or CloudFormation.
  • Solid understanding of CI/CD pipelines, automation practices, and Git-based development workflows.
  • Experience implementing and managing reliability frameworks including SLIs, SLOs, SLAs, and error budgets.
  • Practical knowledge of observability and monitoring tools such as Prometheus, Grafana, ELK/EFK, OpenTelemetry, and distributed tracing solutions.
  • Scripting or programming skills in Python, Go, Bash, or PowerShell.
  • Strong understanding of networking concepts including VPCs, VPNs, load balancers, and firewalls.
  • Familiarity with cloud security principles, compliance frameworks, and operational best practices.
  • Excellent troubleshooting, communication, and stakeholder management skills within global and cross-functional environments.

Responsibilities

  • Lead site reliability and operational excellence initiatives across production systems and cloud-based services.
  • Define, implement, and manage reliability metrics including SLIs, SLOs, SLAs, and error budgets to ensure platform stability and performance.
  • Design and maintain scalable, resilient cloud-native architectures with a strong focus on automation and infrastructure reliability.
  • Build and optimize Infrastructure as Code and CI/CD pipelines to improve deployment efficiency and consistency.
  • Develop and maintain monitoring, logging, tracing, and alerting capabilities to enhance system observability and proactive incident response.
  • Drive incident management processes, including troubleshooting, root cause analysis, post-incident reviews, and preventive improvements.
  • Collaborate with cross-functional global teams across engineering, security, product, and vendor management functions.
  • Support operational maturity initiatives through documentation, runbooks, automation, and continuous process optimization.
  • Mentor engineers and contribute to technical knowledge sharing and reliability best practices across teams.
  • Perform capacity planning, system performance analysis, and reliability assessments to ensure long-term scalability.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now