Site Reliability Engineer

New
Brazil (Remote) / Argentina (Remote) / Colombia (Remote) / Ecuador (Remote) / Mexico (Remote) / Paraguay (Remote) / Peru (Remote)ContractMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English (C1 or C2)
Required Skills
PythonBashKubernetesAzureGrafanaPrometheusTerraformDatadog

Requirements

  • Must be based in Latin America
  • English level - C1 or C2
  • Proven experience as a Site Reliability Engineer or similar role
  • Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry)
  • Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform)
  • Strong programming and scripting skills (Python, Bash)
  • Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes)
  • Understanding of Linux-based systems, networking, and security principles related to containerized applications
  • Strong problem-solving and troubleshooting skills
  • Excellent communication and collaboration abilities

Responsibilities

  • Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
  • Define and implement SLOs, SLIs, and error budgets to measure system reliability.
  • Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
  • Design actionable alerting strategies to minimize noise and improve MTTR.
  • Integrate alerting systems with Jira.
  • Establish and refine runbooks for on-call teams to handle alerts efficiently.
  • Analyze system performance metrics and implement optimizations for scalability.
  • Develop tools to streamline operational processes such as fail-over and configuration management.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now