Staff Site Reliability Engineer

New
Fully remote work environment across Europe. Listing location: Germany.Full-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
8–10 years
Required Skills
DockerPythonKubernetesGoGrafanaPrometheusTerraformDatadog

Requirements

  • 8–10 years of experience in SRE, DevOps, or Infrastructure Engineering.
  • Strong software engineering skills with Python or Go.
  • Deep expertise in distributed systems architecture.
  • Extensive experience with Kubernetes, Docker, and container orchestration.
  • Experience designing observability ecosystems using Prometheus, Grafana, Datadog, or OpenTelemetry.
  • Strong background in incident management and root cause analysis.
  • Hands-on experience with Infrastructure as Code tools like Terraform or Pulumi.
  • Excellent communication skills.
  • Proven leadership and mentoring experience.

Responsibilities

  • Design and implement comprehensive observability solutions.
  • Define, track, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Lead high-severity incident response efforts and conduct blameless post-mortems.
  • Build and maintain infrastructure automation and Infrastructure as Code using Terraform or Pulumi.
  • Develop self-healing systems to reduce operational overhead.
  • Optimize large-scale Kubernetes and cloud-native deployments.
  • Investigate and resolve complex distributed systems issues.
  • Review architectural designs for reliability and scalability.
  • Mentor engineers and establish reliability-focused engineering standards.
  • Build internal tools and automation using Python or Go.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now