Site Reliability Engineer
D
Duvo IncAI Operations Platform
EU/UK BasedFull-TimeMiddle
Salary110,000 - 220,000 EUR per year
Apply NowOpens the employer's application page
Job Details
- Required Skills
- DockerGCPKubernetesGrafanaPrometheusTerraformDistributed Systems
Requirements
- Extensive experience designing and operating large-scale distributed systems.
- Solid understanding of security best practices including KMS encryption and WAF configuration.
- Proven capability in building observability platforms and managing incident response workflows.
- Deep expertise in Infrastructure as Code (IaC) tools and container orchestration.
- Strong automation skills and a drive to eliminate manual runbooks.
- Demonstrated ability to own projects from proposal to production.
- Capacity to make high-judgment decisions regarding reliability investments and trade-offs.
- Experience with GCP, Kubernetes, or similar cloud-native environments.
- Familiarity with multi-tenant isolation or sandboxed execution environments.
Responsibilities
- Own platform reliability, infrastructure, observability, and incident response.
- Manage and scale sandbox infrastructure for AI agents.
- Design and configure monitoring, alerting, and observability pipelines.
- Lead structured incident responses and drive permanent root-cause fixes.
- Automate infrastructure using IaC and container orchestration.
- Inherit and maintain existing infrastructure (Terraform, OpenTelemetry, Prometheus/Grafana).
- Collaborate with AI Platform Engineers to secure and isolate tenant workloads.
View Full Description & ApplyYou'll be redirected to the employer's site