Senior Site Reliability Engineer
New
In the United Kingdom... Possibility to work remotely from locations within the European Union depending on team arrangements.Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonGCPKubernetesAzureGoCI/CDTerraform
Requirements
- 5+ years of hands-on experience in Site Reliability Engineering, Platform Engineering, DevOps, Cloud Infrastructure, or similar infrastructure-focused engineering roles.
- Proven expertise operating and scaling high-throughput, highly available production systems.
- Deep practical experience with Kubernetes in cloud environments such as Azure, AWS, or GCP.
- Strong understanding of observability concepts, including monitoring, SLIs, SLOs, error budgets, logging, and distributed tracing.
- Proficiency in Go or Python, with strong software engineering and automation skills.
- Experience with Infrastructure as Code tools such as Pulumi, Terraform, or OpenTofu, along with GitOps workflows and CI/CD automation.
- Strong knowledge of cloud-native technologies, distributed systems, and reliability engineering best practices.
- Demonstrated experience leading infrastructure initiatives, writing technical proposals, and driving architecture decisions.
- Strong communication skills with the ability to collaborate effectively across technical teams and stakeholders.
- Comfortable participating in on-call rotations and managing critical production incidents.
Responsibilities
- Drive the architecture and evolution of scalable cloud infrastructure and Kubernetes environments designed for high availability and global growth.
- Define and implement platform reliability strategies, including zero-downtime deployments, disaster recovery, rollback mechanisms, and resilience improvements.
- Improve and maintain observability systems, monitoring frameworks, and telemetry infrastructure to support operational excellence and system transparency.
- Build and optimize Infrastructure as Code and self-service platform capabilities to reduce operational overhead and improve developer experience.
- Lead platform-related incident response activities, conduct blameless post-mortems, and implement long-term systemic improvements.
- Collaborate closely with engineering teams to define technical roadmaps, architecture standards, and scalable operational practices.
- Mentor and support teammates through technical guidance, design reviews, and knowledge sharing initiatives.
- Drive continuous improvement in CI/CD pipelines, GitOps workflows, automation strategies, and cloud-native infrastructure operations.
View Full Description & ApplyYou'll be redirected to the employer's site