Senior Site Reliability Engineer - Platform & Agentic Operations
New
You work remotely (Germany-wide)Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English (Very good knowledge of spoken and written); German is a plus
- Experience
- 6+ years
- Required Skills
- PythonGCPTypeScriptGoCI/CDDevOpsTerraform
Requirements
- 6+ years of experience in SRE, DevOps, or Platform Engineering.
- Strong understanding and practical application of Site Reliability Engineering (SRE) principles.
- Proficiency in programming/scripting languages such as Python, GoLang, or TypeScript.
- Practical experience integrating LLMs into automated workflows and providing live system state as context to agents.
- Prior experience in incident management, post-incident reviews, and implementing preventive improvements.
- Ability to troubleshoot complex technical issues systematically.
- Solid experience with a public cloud provider, ideally Google Cloud Platform (GCP).
- Understanding of GCP observability services.
- Proactive approach to identifying performance bottlenecks.
- Excellent communication skills.
- Very good knowledge of spoken and written English.
- Residency in Germany.
Responsibilities
- Implement and improve monitoring, alerting, and incident response systems to ensure high reliability and meet defined SLOs.
- Design, build, and maintain resilient, scalable infrastructure utilizing SRE principles and best practices.
- Attend post-incident reviews, detect patterns, and contribute to continuous improvement efforts.
- Execute performance testing, analyze system bottlenecks, and formulate strategies for capacity planning.
- Build systems where CI/CD test failures serve as real-time context for AI agents to analyze logs and suggest or apply code fixes.
View Full Description & ApplyYou'll be redirected to the employer's site