Site Reliability Engineer Lead

New
BrazilFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English
Required Skills
PythonKubernetesGoGrafanaPrometheusDevOpsTerraform

Requirements

  • Proven experience leading technical teams such as SRE, DevOps, or Cloud Engineering.
  • Strong hands-on experience with SRE principles including SLIs, SLOs, error budgets, and toil reduction.
  • Experience with observability and APM tools such as Datadog, New Relic, or Dynatrace.
  • Solid knowledge of telemetry systems (metrics, logs, traces) using Prometheus and OpenTelemetry (Grafana ecosystem).
  • Experience with Infrastructure as Code tools such as Terraform or AWS CDK.
  • Strong scripting and programming skills in Python, Bash, and at least one language such as Go or Java.
  • Experience with logging and tracing solutions at scale such as Loki, Tempo, Jaeger, or ELK Stack.
  • Strong cloud experience, preferably in AWS environments.
  • Experience with containers and orchestration technologies such as Docker, Kubernetes, or ECS.
  • Solid understanding of incident management and post-mortem processes.
  • Strong Linux systems knowledge and troubleshooting skills.
  • English proficiency for technical reading and writing.

Responsibilities

  • Lead, mentor, and develop a high-performing SRE team, fostering collaboration, technical excellence, and continuous learning.
  • Define the SRE strategy, roadmap, and priorities aligned with cloud and business objectives.
  • Establish and evolve observability standards, including metrics, logs, and traces across systems and applications.
  • Drive adoption and governance of SLIs, SLOs, and error budgets for critical services.
  • Oversee the evolution of observability platforms using tools such as Prometheus, Grafana, OpenTelemetry, Loki, and Tempo.
  • Design and implement actionable alerting strategies to reduce noise and improve incident response efficiency.
  • Lead incident management processes, including escalation, war rooms, communication, and post-mortem reviews.
  • Ensure blameless post-incident analysis and drive systemic improvements based on recurring issues and data insights.
  • Promote automation initiatives to reduce operational toil and improve engineering efficiency.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now