Senior Site Reliability Engineer (SRE)
New
Based in BrazilFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonTypeScriptGo
Requirements
- 5+ years of experience in Site Reliability Engineering, Production Engineering, or similar roles supporting high-availability systems.
- Strong hands-on experience defining and managing SLOs, SLIs, and error budgets in production environments.
- Proven experience leading incident response and acting as Incident Commander during critical production outages.
- Deep expertise in observability tools and practices, including monitoring, logging, alerting, and distributed tracing.
- Strong software engineering skills in Python, Go, or TypeScript, with a focus on automation and reliability engineering.
- Experience working with cloud environments (AWS or similar) and supporting mission-critical systems at scale.
- Demonstrated ability to improve on-call processes, reduce alert noise, and build effective operational frameworks.
- Experience conducting blameless postmortems and driving long-term reliability improvements.
Responsibilities
- Define, implement, and continuously improve SLIs, SLOs, and error budgets to measure and enhance system reliability across production environments.
- Own and evolve observability practices, including monitoring, logging, tracing, and alerting strategies to ensure full system visibility.
- Lead incident response efforts as Incident Commander during production outages, coordinating resolution across engineering teams.
- Design, maintain, and optimize on-call systems, including escalation policies, runbooks, alert tuning, and operational workflows.
- Drive blameless postmortems and ensure follow-through on corrective actions to prevent recurrence of production issues.
- Collaborate with engineering teams on production readiness, capacity planning, scalability, and disaster recovery initiatives.
- Automate operational tasks and reliability processes using software engineering practices to improve system efficiency and resilience.
View Full Description & ApplyYou'll be redirected to the employer's site