Platform Engineer (Site Reliability Engineering)
New
BrazilFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English
- Required Skills
- PythonJavaKubernetesCI/CDDevOps
Requirements
- Proven experience in Site Reliability Engineering, Platform Engineering, DevOps, or similar infrastructure-focused roles.
- Hands-on experience with Kubernetes, including deployment, debugging, and production troubleshooting.
- Strong understanding of CI/CD pipelines and modern DevOps practices.
- Software development experience in any modern language (Python or Java strongly preferred).
- Strong automation mindset with a focus on reducing repetitive operational work through tooling.
- Experience with observability tools, monitoring systems, and alerting frameworks.
- Familiarity with AI/LLM-based workflows or agentic automation is highly desirable.
- Ability to manage high-severity incidents and communicate clearly with technical and non-technical stakeholders.
- Strong written and verbal communication skills in English.
- Self-driven, proactive mindset with the ability to operate independently in ambiguous situations.
Responsibilities
- Own and drive end-to-end incident management processes, ensuring rapid response, clear communication, and effective resolution during production incidents.
- Lead on-call operations, including incident triage, escalation, coordination, and stakeholder communication across severity levels.
- Design and implement automation to improve postmortem workflows, including tracking action items, ownership, and remediation follow-ups.
- Build tooling and AI-assisted workflows to reduce operational toil and accelerate incident detection, response, and resolution.
- Improve observability systems, including dashboards, alerting strategies, and monitoring systems across distributed systems.
- Conduct post-incident analysis to identify root causes and implement long-term reliability improvements.
- Collaborate with engineering teams to define preventive measures, improve runbooks, and reduce recurring incidents.
- Support change and deployment processes with a focus on risk mitigation and system stability.
View Full Description & ApplyYou'll be redirected to the employer's site