Platform Engineer (Site Reliability Engineering)

New
BrazilFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English
Required Skills
PythonJavaKubernetesCI/CDDevOps

Requirements

  • Proven experience in Site Reliability Engineering, Platform Engineering, DevOps, or similar infrastructure-focused roles.
  • Hands-on experience with Kubernetes, including deployment, debugging, and production troubleshooting.
  • Strong understanding of CI/CD pipelines and modern DevOps practices.
  • Software development experience in any modern language (Python or Java strongly preferred).
  • Strong automation mindset with a focus on reducing repetitive operational work through tooling.
  • Experience with observability tools, monitoring systems, and alerting frameworks.
  • Familiarity with AI/LLM-based workflows or agentic automation is highly desirable.
  • Ability to manage high-severity incidents and communicate clearly with technical and non-technical stakeholders.
  • Strong written and verbal communication skills in English.
  • Self-driven, proactive mindset with the ability to operate independently in ambiguous situations.

Responsibilities

  • Own and drive end-to-end incident management processes, ensuring rapid response, clear communication, and effective resolution during production incidents.
  • Lead on-call operations, including incident triage, escalation, coordination, and stakeholder communication across severity levels.
  • Design and implement automation to improve postmortem workflows, including tracking action items, ownership, and remediation follow-ups.
  • Build tooling and AI-assisted workflows to reduce operational toil and accelerate incident detection, response, and resolution.
  • Improve observability systems, including dashboards, alerting strategies, and monitoring systems across distributed systems.
  • Conduct post-incident analysis to identify root causes and implement long-term reliability improvements.
  • Collaborate with engineering teams to define preventive measures, improve runbooks, and reduce recurring incidents.
  • Support change and deployment processes with a focus on risk mitigation and system stability.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now