Site Reliability Engineer

D
Duvo IncAI Operations Platform
EU/UK BasedFull-TimeMiddle
Salary110,000 - 220,000 EUR per year
Apply NowOpens the employer's application page

Job Details

Required Skills
DockerGCPKubernetesGrafanaPrometheusTerraformDistributed Systems

Requirements

  • Extensive experience designing and operating large-scale distributed systems.
  • Solid understanding of security best practices including KMS encryption and WAF configuration.
  • Proven capability in building observability platforms and managing incident response workflows.
  • Deep expertise in Infrastructure as Code (IaC) tools and container orchestration.
  • Strong automation skills and a drive to eliminate manual runbooks.
  • Demonstrated ability to own projects from proposal to production.
  • Capacity to make high-judgment decisions regarding reliability investments and trade-offs.
  • Experience with GCP, Kubernetes, or similar cloud-native environments.
  • Familiarity with multi-tenant isolation or sandboxed execution environments.

Responsibilities

  • Own platform reliability, infrastructure, observability, and incident response.
  • Manage and scale sandbox infrastructure for AI agents.
  • Design and configure monitoring, alerting, and observability pipelines.
  • Lead structured incident responses and drive permanent root-cause fixes.
  • Automate infrastructure using IaC and container orchestration.
  • Inherit and maintain existing infrastructure (Terraform, OpenTelemetry, Prometheus/Grafana).
  • Collaborate with AI Platform Engineers to secure and isolate tenant workloads.
View Full Description & ApplyYou'll be redirected to the employer's site
110,000 - 220,000 EUR per year
Apply Now