Apply

Site Reliability Engineer - Observability

Posted about 2 months agoViewed

View full description

💎 Seniority level: Senior, Proven experience in Software Engineering, SRE, or a similar role

📍 Location: Finland, Sweden, Germany, Denmark, Estonia

🔍 Industry: Technology and Delivery Services

🗣️ Languages: English

⏳ Experience: Proven experience in Software Engineering, SRE, or a similar role

🪄 Skills: AWSDockerPythonElasticSearchGCPKubernetesAzureGoGrafanaPrometheusCI/CDTerraformAnsible

Requirements:
  • Proven experience in Software Engineering, SRE, or a similar role with a focus on observability and reliability.
  • Strong foundation in computer science principles and engineering fundamentals.
  • Proficient in Go or Python, with experience building automation tools and software.
  • Hands-on experience with observability tooling such as DataDog, Prometheus, Mimir, Elasticsearch, Grafana, and Jaeger.
  • Expertise in cloud platforms like AWS, GCP, or Azure, managing infrastructure using Kubernetes and Docker.
  • Deep knowledge of building and maintaining reliable, high-performance, and scalable distributed systems.
  • Solid understanding of SRE principles, incident response, and designing fault-tolerant architectures.
Responsibilities:
  • Be responsible for building and improving our observability platform and tooling, used by all Wolt engineers.
  • Contribute to initiatives focused on architecting, building, and maintaining our observability stack.
  • Champion observability best practices, guiding and supporting other Woltians.
  • Take ownership of key initiatives to improve the quality, efficiency, and reliability of our observability stack.
  • Participate in the on-call rotation to address incidents and outages.
  • Help standardize observability resources by building tools and documentation.
Apply

Related Jobs

Apply

📍 Finland, Sweden, Germany, Denmark, Estonia

🧭 Full-Time

  • Proven experience in Software Engineering, SRE, or a similar role with a focus on observability, reliability, and scaling large systems.
  • Experience with OpenTelemetry, which is a key foundation for much of the infrastructure and tooling the team is converging on.
  • Strong foundation in computer science principles and engineering fundamentals.
  • Proficient in development, particularly in Go (preferred) or Python, with experience building automation tools and software for large-scale, distributed systems.
  • Hands-on experience with observability tooling such as DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger, and tracing systems.
  • Expertise in cloud platforms like AWS, GCP, or Azure, with experience managing cloud infrastructure using Kubernetes and containers (Docker).
  • Deep knowledge of building and maintaining reliable, high-performance, and scalable distributed systems.
  • Solid understanding of SRE principles, incident response, and designing fault-tolerant architectures.
  • Experience with infrastructure-as-code tools like Terraform or Ansible for managing cloud environments.
  • Familiarity with CI/CD pipelines, automated testing, and continuous delivery practices.
  • Strong analytical and problem-solving skills, with experience troubleshooting complex distributed systems.
  • Excellent communication and collaboration skills, with the ability to work cross-functionally to enhance platform observability and reliability.
  • Experience working directly with development teams, with a willingness to dive into application code for observability-related topics.
  • Solid experience with Docker and Kubernetes, coupled with a strong foundation in Unix systems and networking concepts.
  • Be responsible for building and improving our observability platform and tooling, used by all Wolt engineers.
  • Contribute to initiatives focused on architecting, building, and maintaining our observability stack to efficiently handle increasing telemetry data with greater reliability.
  • Champion observability best practices, guiding and supporting other Woltians in this space.
  • Take ownership of key initiatives to improve the quality, efficiency, and reliability of our observability stack.
  • Apply expertise in SRE culture and practices to ensure observability has a meaningful impact on the business.
  • Participate in the on-call rotation to address incidents and outages, resolving reliability issues efficiently.
  • Help standardize observability resources by building tools and documentation that enhance productivity and developer experience.
  • Triage and resolve production issues within the observability scope.
  • Contribute to open-source efforts by sharing some of our internal tools with the broader community.

AWSDockerPythonElasticSearchGCPKubernetesAzureGoGrafanaPrometheusCI/CDTerraformAnsible

Posted about 1 month ago
Apply