Site Reliability Engineer
New
F
Flip AppAI employee experience platform
Im europäischen AuslandFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- Englisch
- Experience
- 1–3 Jahre
- Required Skills
- AWSPythonGCPKotlinKubernetesAzureGoPrometheusCI/CDTerraformAnsible
Requirements
- 1–3 years of hands-on experience as Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer or Backend Engineer with strong infrastructure focus
- Experience in operating and scaling cloud infrastructures (Azure, GCP, AWS)
- Deep knowledge in Kubernetes and container orchestration in production environments
- Hands-on experience with modern Observability-Stacks (e.g., Prometheus, Mimir, Loki, ELK) and familiar with defining and operating SLOs and Error Budgets
- Fundierte Software-Entwicklungskenntnisse in Go (bevorzugt, da unser IaC auf Pulumi in Go läuft), Python oder Kotlin
- Hands-on experience with Infrastructure as Code (e.g., Pulumi, OpenTofu, Terraform) and Konfigurations-Tools (e.g., Ansible, Chef)
- A collaborative mindset, strong communication skills and verhandlungssicheres Englisch
- Willingness to participate in On-Call-Rotationen (Rufbereitschaft), um die Zuverlässigkeit unserer Plattform zu gewährleisten
Responsibilities
- Enable scaling: Expand and optimize our cloud infrastructure on Azure and our Kubernetes clusters – designed for high throughput and highest availability – to support Flip's rapid global growth.
- Ensure resilience & security: Design and implement zero-downtime deployments, rollback mechanisms, and disaster recovery strategies that keep our platform available around the clock.
- Create observability: Further develop our LGTM stack (Loki, Grafana, Tempo, Mimir) to provide every team with the necessary visibility – and use it to define and optimize our SLOs.
- Automate everything: Design, develop, and optimize Infrastructure as Code with Pulumi in Go to eliminate manual effort (Toil) and provide our platform as self-service for engineering teams.
- Drive reliability practices: Promote CI/CD best practices, incident management, post-mortems, and developer experience across the entire engineering organization.
- Shape our roadmap: Work with your squad and engineering leadership to define the direction of the platform – from scalable high-throughput systems and cost optimization to security posture and compliance.
View Full Description & ApplyYou'll be redirected to the employer's site