Site Reliability Engineer Specialist

New
BrazilFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English and Portuguese
Experience
8+ years
Required Skills
Node.jsPostgreSQLJavaKubernetesMongoDBPrometheus

Requirements

  • 8+ years of experience in SRE, infrastructure, or platform engineering in large-scale production environments.
  • Strong hands-on experience with Kubernetes (preferably GKE), including debugging production workloads.
  • Deep expertise in observability systems (OpenTelemetry, Prometheus, Elasticsearch, Logstash, Fluent Bit).
  • Experience defining and operationalizing SLIs, SLOs, and error budgets.
  • Strong background in leading high-severity incidents and postmortem processes.
  • Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
  • Production experience with Java services (JVM tuning, performance troubleshooting) and Node.js.
  • Proven ability to influence engineering teams and mentor senior engineers.
  • Fluency in English and Portuguese for cross-functional collaboration.

Responsibilities

  • Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing.
  • Establish and evolve SLIs, SLOs, and error budgets to drive engineering and product decision-making.
  • Lead major incident response efforts as incident commander and conduct blameless postmortems.
  • Improve on-call practices by reducing alert noise and minimizing toil.
  • Influence and support architectural decisions across distributed systems like GKE, Kong, RabbitMQ, PostgreSQL, and MongoDB Atlas.
  • Mentor SRE and platform engineers to raise overall reliability maturity.
  • Drive adoption of observability and reliability best practices across Java and Node.js production services.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now