Site Reliability Engineer Specialist

New

BrazilFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

8+ years of experience in SRE, infrastructure, or platform engineering in large-scale production environments.
Strong hands-on experience with Kubernetes (preferably GKE), including debugging production workloads.
Deep expertise in observability systems (OpenTelemetry, Prometheus, Elasticsearch, Logstash, Fluent Bit).
Experience defining and operationalizing SLIs, SLOs, and error budgets.
Strong background in leading high-severity incidents and postmortem processes.
Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
Production experience with Java services (JVM tuning, performance troubleshooting) and Node.js.
Proven ability to influence engineering teams and mentor senior engineers.
Fluency in English and Portuguese for cross-functional collaboration.

Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing.
Establish and evolve SLIs, SLOs, and error budgets to drive engineering and product decision-making.
Lead major incident response efforts as incident commander and conduct blameless postmortems.
Improve on-call practices by reducing alert noise and minimizing toil.
Influence and support architectural decisions across distributed systems like GKE, Kong, RabbitMQ, PostgreSQL, and MongoDB Atlas.
Mentor SRE and platform engineers to raise overall reliability maturity.
Drive adoption of observability and reliability best practices across Java and Node.js production services.

View Full Description & ApplyYou'll be redirected to the employer's site