Senior Site Reliability Engineer

Spain / United KingdomFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English
Experience
5+ years of experience as a Site Reliability Engineer or in a similar role
Required Skills
AWSPythonJavaKubernetesGoPrometheusCI/CDTerraformDatadog

Requirements

  • Bachelor’s degree in Computer Engineering or a similar discipline.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role.
  • 3+ years of experience with AWS services including strong knowledge of container orchestration.
  • 2+ years of Kubernetes experience.
  • Deep understanding of observability principles and tools such as: Prometheus, Datadog, OpenTelemetry and similar.
  • Experience with leading incident management and complex postmortem analysis.
  • Experience and interest in managing infrastructure as code (Terraform).
  • Experience with chaos engineering and other techniques for testing system resilience.
  • Experience with CI/CD tools such as GitHub Actions for automated delivery.
  • Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling.
  • Event-driven architecture experience (SNS, SQS etc).
  • Good communication skills and fluency in English.

Responsibilities

  • Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies.
  • Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures.
  • Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them.
  • Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response.
  • Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights.
  • Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
  • Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC.
  • Continuously evaluate and optimize system performance, capacity, and cost efficiency.
  • Refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now