Site Reliability Engineer Lead
New
BrazilFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English
- Required Skills
- PythonKubernetesGoGrafanaPrometheusDevOpsTerraform
Requirements
- Proven experience leading technical teams such as SRE, DevOps, or Cloud Engineering.
- Strong hands-on experience with SRE principles including SLIs, SLOs, error budgets, and toil reduction.
- Experience with observability and APM tools such as Datadog, New Relic, or Dynatrace.
- Solid knowledge of telemetry systems (metrics, logs, traces) using Prometheus and OpenTelemetry (Grafana ecosystem).
- Experience with Infrastructure as Code tools such as Terraform or AWS CDK.
- Strong scripting and programming skills in Python, Bash, and at least one language such as Go or Java.
- Experience with logging and tracing solutions at scale such as Loki, Tempo, Jaeger, or ELK Stack.
- Strong cloud experience, preferably in AWS environments.
- Experience with containers and orchestration technologies such as Docker, Kubernetes, or ECS.
- Solid understanding of incident management and post-mortem processes.
- Strong Linux systems knowledge and troubleshooting skills.
- English proficiency for technical reading and writing.
Responsibilities
- Lead, mentor, and develop a high-performing SRE team, fostering collaboration, technical excellence, and continuous learning.
- Define the SRE strategy, roadmap, and priorities aligned with cloud and business objectives.
- Establish and evolve observability standards, including metrics, logs, and traces across systems and applications.
- Drive adoption and governance of SLIs, SLOs, and error budgets for critical services.
- Oversee the evolution of observability platforms using tools such as Prometheus, Grafana, OpenTelemetry, Loki, and Tempo.
- Design and implement actionable alerting strategies to reduce noise and improve incident response efficiency.
- Lead incident management processes, including escalation, war rooms, communication, and post-mortem reviews.
- Ensure blameless post-incident analysis and drive systemic improvements based on recurring issues and data insights.
- Promote automation initiatives to reduce operational toil and improve engineering efficiency.
View Full Description & ApplyYou'll be redirected to the employer's site