Staff Site Reliability Engineer
New
W
Wand Synthesis AI IncAI Infrastructure
Europe Timezone, Europe TimezoneFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSKubernetesAzureCI/CDTerraformMLOpsDistributed Systems
Requirements
- Extensive hands-on experience in SRE or Production Engineering roles.
- Demonstrated experience building or scaling SRE practices in high-growth or complex environments.
- Deep expertise in AWS or Azure-based cloud infrastructure.
- Strong experience with Kubernetes (including migration, scaling, and production hardening).
- Advanced Infrastructure-as-Code experience (Terraform or equivalent).
- End-to-end CI/CD pipeline design and optimisation experience.
- Strong experience with observability tooling across distributed systems.
- Experience troubleshooting complex multi-tenant or customer-hosted environments.
- Experience supporting production data platforms and ML systems.
- MLOps experience, including model deployment and monitoring.
- Strong understanding of distributed systems, scalability, and fault tolerance.
Responsibilities
- Architect, deploy, and operate scalable, secure production environments (AWS preferred).
- Lead reliability improvements across multiple engineering streams.
- Design and evolve Kubernetes-based infrastructure, including migration and optimisation initiatives.
- Build and enforce strong Infrastructure-as-Code standards.
- Define and operationalise SLIs, SLOs, and error budgets.
- Strengthen observability across applications, infrastructure, data pipelines, and ML systems.
- Work across and optimise the entire CI/CD pipeline.
- Lead incident response for complex cross-system failures and drive postmortems.
- Support and productionise ML workloads (MLOps).
- Mentor engineers and raise the overall reliability bar.
View Full Description & ApplyYou'll be redirected to the employer's site