Senior Site Reliability Engineer - Platform & Agentic Operations

New
You work remotely (Germany-wide)Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English (Very good knowledge of spoken and written); German is a plus
Experience
6+ years
Required Skills
PythonGCPTypeScriptGoCI/CDDevOpsTerraform

Requirements

  • 6+ years of experience in SRE, DevOps, or Platform Engineering.
  • Strong understanding and practical application of Site Reliability Engineering (SRE) principles.
  • Proficiency in programming/scripting languages such as Python, GoLang, or TypeScript.
  • Practical experience integrating LLMs into automated workflows and providing live system state as context to agents.
  • Prior experience in incident management, post-incident reviews, and implementing preventive improvements.
  • Ability to troubleshoot complex technical issues systematically.
  • Solid experience with a public cloud provider, ideally Google Cloud Platform (GCP).
  • Understanding of GCP observability services.
  • Proactive approach to identifying performance bottlenecks.
  • Excellent communication skills.
  • Very good knowledge of spoken and written English.
  • Residency in Germany.

Responsibilities

  • Implement and improve monitoring, alerting, and incident response systems to ensure high reliability and meet defined SLOs.
  • Design, build, and maintain resilient, scalable infrastructure utilizing SRE principles and best practices.
  • Attend post-incident reviews, detect patterns, and contribute to continuous improvement efforts.
  • Execute performance testing, analyze system bottlenecks, and formulate strategies for capacity planning.
  • Build systems where CI/CD test failures serve as real-time context for AI agents to analyze logs and suggest or apply code fixes.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now