Senior Site Reliability Engineer (SRE)

New
Based in BrazilFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSPythonTypeScriptGo

Requirements

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, or similar roles supporting high-availability systems.
  • Strong hands-on experience defining and managing SLOs, SLIs, and error budgets in production environments.
  • Proven experience leading incident response and acting as Incident Commander during critical production outages.
  • Deep expertise in observability tools and practices, including monitoring, logging, alerting, and distributed tracing.
  • Strong software engineering skills in Python, Go, or TypeScript, with a focus on automation and reliability engineering.
  • Experience working with cloud environments (AWS or similar) and supporting mission-critical systems at scale.
  • Demonstrated ability to improve on-call processes, reduce alert noise, and build effective operational frameworks.
  • Experience conducting blameless postmortems and driving long-term reliability improvements.

Responsibilities

  • Define, implement, and continuously improve SLIs, SLOs, and error budgets to measure and enhance system reliability across production environments.
  • Own and evolve observability practices, including monitoring, logging, tracing, and alerting strategies to ensure full system visibility.
  • Lead incident response efforts as Incident Commander during production outages, coordinating resolution across engineering teams.
  • Design, maintain, and optimize on-call systems, including escalation policies, runbooks, alert tuning, and operational workflows.
  • Drive blameless postmortems and ensure follow-through on corrective actions to prevent recurrence of production issues.
  • Collaborate with engineering teams on production readiness, capacity planning, scalability, and disaster recovery initiatives.
  • Automate operational tasks and reliability processes using software engineering practices to improve system efficiency and resilience.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now