Site Reliability Engineer Lead

New

BrazilFull-TimeLead

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Languages: English
Required Skills: PythonKubernetesGoGrafanaPrometheusDevOpsTerraform

Requirements

Proven experience leading technical teams such as SRE, DevOps, or Cloud Engineering.
Strong hands-on experience with SRE principles including SLIs, SLOs, error budgets, and toil reduction.
Experience with observability and APM tools such as Datadog, New Relic, or Dynatrace.
Solid knowledge of telemetry systems (metrics, logs, traces) using Prometheus and OpenTelemetry (Grafana ecosystem).
Experience with Infrastructure as Code tools such as Terraform or AWS CDK.
Strong scripting and programming skills in Python, Bash, and at least one language such as Go or Java.
Experience with logging and tracing solutions at scale such as Loki, Tempo, Jaeger, or ELK Stack.
Strong cloud experience, preferably in AWS environments.
Experience with containers and orchestration technologies such as Docker, Kubernetes, or ECS.
Solid understanding of incident management and post-mortem processes.
Strong Linux systems knowledge and troubleshooting skills.
English proficiency for technical reading and writing.

Responsibilities

Lead, mentor, and develop a high-performing SRE team, fostering collaboration, technical excellence, and continuous learning.
Define the SRE strategy, roadmap, and priorities aligned with cloud and business objectives.
Establish and evolve observability standards, including metrics, logs, and traces across systems and applications.
Drive adoption and governance of SLIs, SLOs, and error budgets for critical services.
Oversee the evolution of observability platforms using tools such as Prometheus, Grafana, OpenTelemetry, Loki, and Tempo.
Design and implement actionable alerting strategies to reduce noise and improve incident response efficiency.
Lead incident management processes, including escalation, war rooms, communication, and post-mortem reviews.
Ensure blameless post-incident analysis and drive systemic improvements based on recurring issues and data insights.
Promote automation initiatives to reduce operational toil and improve engineering efficiency.

View Full Description & ApplyYou'll be redirected to the employer's site

Similar Jobs

Lead Frontend Engineer

Nortal

Location: Latin AmericaFull-Time

View Job

Senior Site Reliability Engineer, Infrastructure Foundations

Wikimedia Foundation

Please note that we are currently able to hire in the following: US States: Arizona, California, Colorado, Connecticut, District of Columbia*, Florida, Georgia, Idaho, Illinois, Indiana, Iowa, Maryland, Massachusetts, Michigan, Minnesota, Missouri, New Jersey, New Mexico, New York, North Carolina, Ohio, Oklahoma, Oregon, Pennsylvania, Puerto Rico*, Rhode Island, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin and Wyoming (*US Territory or Federal District) Countries: Brazil, Canada, Colombia, France, Germany, Ghana, India, Indonesia, Italy, Kenya*, Mexico, Morocco, Netherlands, Poland, Singapore*, South Africa, Spain, Switzerland and the United Kingdom.Full-Time

113,082 - 175,725 USD per year

View Job

Database Reliability Engineer

Sporty Group

Europe - Remote; LATAM - RemoteFull-Time

View Job