Site Reliability Engineer Specialist
New
BrazilFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English and Portuguese
- Experience
- 8+ years
- Required Skills
- Node.jsPostgreSQLJavaKubernetesMongoDBPrometheus
Requirements
- 8+ years of experience in SRE, infrastructure, or platform engineering in large-scale production environments.
- Strong hands-on experience with Kubernetes (preferably GKE), including debugging production workloads.
- Deep expertise in observability systems (OpenTelemetry, Prometheus, Elasticsearch, Logstash, Fluent Bit).
- Experience defining and operationalizing SLIs, SLOs, and error budgets.
- Strong background in leading high-severity incidents and postmortem processes.
- Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
- Production experience with Java services (JVM tuning, performance troubleshooting) and Node.js.
- Proven ability to influence engineering teams and mentor senior engineers.
- Fluency in English and Portuguese for cross-functional collaboration.
Responsibilities
- Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing.
- Establish and evolve SLIs, SLOs, and error budgets to drive engineering and product decision-making.
- Lead major incident response efforts as incident commander and conduct blameless postmortems.
- Improve on-call practices by reducing alert noise and minimizing toil.
- Influence and support architectural decisions across distributed systems like GKE, Kong, RabbitMQ, PostgreSQL, and MongoDB Atlas.
- Mentor SRE and platform engineers to raise overall reliability maturity.
- Drive adoption of observability and reliability best practices across Java and Node.js production services.
View Full Description & ApplyYou'll be redirected to the employer's site