Site Reliability Engineer

New
F
Flip AppAI employee experience platform
Im europäischen AuslandFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
Englisch
Experience
1–3 Jahre
Required Skills
AWSPythonGCPKotlinKubernetesAzureGoPrometheusCI/CDTerraformAnsible

Requirements

  • 1–3 years of hands-on experience as Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer or Backend Engineer with strong infrastructure focus
  • Experience in operating and scaling cloud infrastructures (Azure, GCP, AWS)
  • Deep knowledge in Kubernetes and container orchestration in production environments
  • Hands-on experience with modern Observability-Stacks (e.g., Prometheus, Mimir, Loki, ELK) and familiar with defining and operating SLOs and Error Budgets
  • Fundierte Software-Entwicklungskenntnisse in Go (bevorzugt, da unser IaC auf Pulumi in Go läuft), Python oder Kotlin
  • Hands-on experience with Infrastructure as Code (e.g., Pulumi, OpenTofu, Terraform) and Konfigurations-Tools (e.g., Ansible, Chef)
  • A collaborative mindset, strong communication skills and verhandlungssicheres Englisch
  • Willingness to participate in On-Call-Rotationen (Rufbereitschaft), um die Zuverlässigkeit unserer Plattform zu gewährleisten

Responsibilities

  • Enable scaling: Expand and optimize our cloud infrastructure on Azure and our Kubernetes clusters – designed for high throughput and highest availability – to support Flip's rapid global growth.
  • Ensure resilience & security: Design and implement zero-downtime deployments, rollback mechanisms, and disaster recovery strategies that keep our platform available around the clock.
  • Create observability: Further develop our LGTM stack (Loki, Grafana, Tempo, Mimir) to provide every team with the necessary visibility – and use it to define and optimize our SLOs.
  • Automate everything: Design, develop, and optimize Infrastructure as Code with Pulumi in Go to eliminate manual effort (Toil) and provide our platform as self-service for engineering teams.
  • Drive reliability practices: Promote CI/CD best practices, incident management, post-mortems, and developer experience across the entire engineering organization.
  • Shape our roadmap: Work with your squad and engineering leadership to define the direction of the platform – from scalable high-throughput systems and cost optimization to security posture and compliance.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now