Senior Site Reliability Engineer

Spain / United KingdomFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Languages: English
Experience: 5+ years of experience as a Site Reliability Engineer or in a similar role
Required Skills: AWSPythonJavaKubernetesGoPrometheusCI/CDTerraformDatadog

Bachelor’s degree in Computer Engineering or a similar discipline.
5+ years of experience as a Site Reliability Engineer or in a similar role.
3+ years of experience with AWS services including strong knowledge of container orchestration.
2+ years of Kubernetes experience.
Deep understanding of observability principles and tools such as: Prometheus, Datadog, OpenTelemetry and similar.
Experience with leading incident management and complex postmortem analysis.
Experience and interest in managing infrastructure as code (Terraform).
Experience with chaos engineering and other techniques for testing system resilience.
Experience with CI/CD tools such as GitHub Actions for automated delivery.
Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling.
Event-driven architecture experience (SNS, SQS etc).
Good communication skills and fluency in English.

Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies.
Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures.
Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them.
Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response.
Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights.
Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC.
Continuously evaluate and optimize system performance, capacity, and cost efficiency.
Refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.

View Full Description & ApplyYou'll be redirected to the employer's site