Staff Site Reliability Engineer
New
Fully remote work environment across Europe. Listing location: Germany.Full-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8–10 years
- Required Skills
- DockerPythonKubernetesGoGrafanaPrometheusTerraformDatadog
Requirements
- 8–10 years of experience in SRE, DevOps, or Infrastructure Engineering.
- Strong software engineering skills with Python or Go.
- Deep expertise in distributed systems architecture.
- Extensive experience with Kubernetes, Docker, and container orchestration.
- Experience designing observability ecosystems using Prometheus, Grafana, Datadog, or OpenTelemetry.
- Strong background in incident management and root cause analysis.
- Hands-on experience with Infrastructure as Code tools like Terraform or Pulumi.
- Excellent communication skills.
- Proven leadership and mentoring experience.
Responsibilities
- Design and implement comprehensive observability solutions.
- Define, track, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Lead high-severity incident response efforts and conduct blameless post-mortems.
- Build and maintain infrastructure automation and Infrastructure as Code using Terraform or Pulumi.
- Develop self-healing systems to reduce operational overhead.
- Optimize large-scale Kubernetes and cloud-native deployments.
- Investigate and resolve complex distributed systems issues.
- Review architectural designs for reliability and scalability.
- Mentor engineers and establish reliability-focused engineering standards.
- Build internal tools and automation using Python or Go.
View Full Description & ApplyYou'll be redirected to the employer's site