Senior Site Reliability Engineer - Platform & Agentic Operations

New

You work remotely (Germany-wide)Full-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Languages: English (Very good knowledge of spoken and written); German is a plus
Experience: 6+ years
Required Skills: PythonGCPTypeScriptGoCI/CDDevOpsTerraform

6+ years of experience in SRE, DevOps, or Platform Engineering.
Strong understanding and practical application of Site Reliability Engineering (SRE) principles.
Proficiency in programming/scripting languages such as Python, GoLang, or TypeScript.
Practical experience integrating LLMs into automated workflows and providing live system state as context to agents.
Prior experience in incident management, post-incident reviews, and implementing preventive improvements.
Ability to troubleshoot complex technical issues systematically.
Solid experience with a public cloud provider, ideally Google Cloud Platform (GCP).
Understanding of GCP observability services.
Proactive approach to identifying performance bottlenecks.
Excellent communication skills.
Very good knowledge of spoken and written English.
Residency in Germany.

Implement and improve monitoring, alerting, and incident response systems to ensure high reliability and meet defined SLOs.
Design, build, and maintain resilient, scalable infrastructure utilizing SRE principles and best practices.
Attend post-incident reviews, detect patterns, and contribute to continuous improvement efforts.
Execute performance testing, analyze system bottlenecks, and formulate strategies for capacity planning.
Build systems where CI/CD test failures serve as real-time context for AI agents to analyze logs and suggest or apply code fixes.

View Full Description & ApplyYou'll be redirected to the employer's site