Site Reliability Engineer
New
Brazil (Remote) / Argentina (Remote) / Colombia (Remote) / Ecuador (Remote) / Mexico (Remote) / Paraguay (Remote) / Peru (Remote)ContractMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English (C1 or C2)
- Required Skills
- PythonBashKubernetesAzureGrafanaPrometheusTerraformDatadog
Requirements
- Must be based in Latin America
- English level - C1 or C2
- Proven experience as a Site Reliability Engineer or similar role
- Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry)
- Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform)
- Strong programming and scripting skills (Python, Bash)
- Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes)
- Understanding of Linux-based systems, networking, and security principles related to containerized applications
- Strong problem-solving and troubleshooting skills
- Excellent communication and collaboration abilities
Responsibilities
- Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
- Define and implement SLOs, SLIs, and error budgets to measure system reliability.
- Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
- Design actionable alerting strategies to minimize noise and improve MTTR.
- Integrate alerting systems with Jira.
- Establish and refine runbooks for on-call teams to handle alerts efficiently.
- Analyze system performance metrics and implement optimizations for scalability.
- Develop tools to streamline operational processes such as fail-over and configuration management.
View Full Description & ApplyYou'll be redirected to the employer's site