Senior Site Reliability Engineer
Spain / United KingdomFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English
- Experience
- 5+ years of experience as a Site Reliability Engineer or in a similar role
- Required Skills
- AWSPythonJavaKubernetesGoPrometheusCI/CDTerraformDatadog
Requirements
- Bachelor’s degree in Computer Engineering or a similar discipline.
- 5+ years of experience as a Site Reliability Engineer or in a similar role.
- 3+ years of experience with AWS services including strong knowledge of container orchestration.
- 2+ years of Kubernetes experience.
- Deep understanding of observability principles and tools such as: Prometheus, Datadog, OpenTelemetry and similar.
- Experience with leading incident management and complex postmortem analysis.
- Experience and interest in managing infrastructure as code (Terraform).
- Experience with chaos engineering and other techniques for testing system resilience.
- Experience with CI/CD tools such as GitHub Actions for automated delivery.
- Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling.
- Event-driven architecture experience (SNS, SQS etc).
- Good communication skills and fluency in English.
Responsibilities
- Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies.
- Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures.
- Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them.
- Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response.
- Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights.
- Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
- Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC.
- Continuously evaluate and optimize system performance, capacity, and cost efficiency.
- Refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.
View Full Description & ApplyYou'll be redirected to the employer's site