Staff Site Reliability Engineer

New
W
Wand Synthesis AI IncAI Infrastructure
Europe Timezone, Europe TimezoneFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSKubernetesAzureCI/CDTerraformMLOpsDistributed Systems

Requirements

  • Extensive hands-on experience in SRE or Production Engineering roles.
  • Demonstrated experience building or scaling SRE practices in high-growth or complex environments.
  • Deep expertise in AWS or Azure-based cloud infrastructure.
  • Strong experience with Kubernetes (including migration, scaling, and production hardening).
  • Advanced Infrastructure-as-Code experience (Terraform or equivalent).
  • End-to-end CI/CD pipeline design and optimisation experience.
  • Strong experience with observability tooling across distributed systems.
  • Experience troubleshooting complex multi-tenant or customer-hosted environments.
  • Experience supporting production data platforms and ML systems.
  • MLOps experience, including model deployment and monitoring.
  • Strong understanding of distributed systems, scalability, and fault tolerance.

Responsibilities

  • Architect, deploy, and operate scalable, secure production environments (AWS preferred).
  • Lead reliability improvements across multiple engineering streams.
  • Design and evolve Kubernetes-based infrastructure, including migration and optimisation initiatives.
  • Build and enforce strong Infrastructure-as-Code standards.
  • Define and operationalise SLIs, SLOs, and error budgets.
  • Strengthen observability across applications, infrastructure, data pipelines, and ML systems.
  • Work across and optimise the entire CI/CD pipeline.
  • Lead incident response for complex cross-system failures and drive postmortems.
  • Support and productionise ML workloads (MLOps).
  • Mentor engineers and raise the overall reliability bar.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now