Sr. Site Reliability Engineer - SRE

New
SpainFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSPythonBashKubernetesGoCI/CDTerraformDatadog

Requirements

  • Demonstrated experience operating and improving production systems at scale.
  • Strong troubleshooting skills with a methodical approach to incident response.
  • Experience defining and using SLIs, SLOs, and error budgets.
  • Proficiency in AWS cloud infrastructure and services.
  • Experience with Kubernetes platforms, specifically Amazon EKS.
  • Knowledge of identity and access management systems such as Auth0 and AWS IAM.
  • Familiarity with networking fundamentals like DNS, load balancing, and TLS.
  • Experience with GitOps workflows and infrastructure automation using Terraform and Flux.
  • Demonstrated ability to build automation and tooling using Python, Go, or Bash.
  • Excellent written and verbal communication skills.

Responsibilities

  • Design, implement, and maintain highly available, scalable, and resilient systems.
  • Define and enforce best practices for monitoring, alerting, and logging within Datadog.
  • Develop robust software and tooling to automate operational tasks and reduce toil.
  • Participate in on-call rotations and lead blameless post-mortems for incident response.
  • Collaborate with engineering teams to define and track SLIs, SLOs, and error budgets.
  • Contribute to infrastructure as code efforts using Terraform and GitHub Actions.
  • Provide SRE expertise in system design reviews and architecture.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now