Site Reliability Engineer

New
IndiaFull-Time
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSDockerPythonBashGCPKubernetesAzureGoCI/CDTerraform

Requirements

  • Proven experience as a Site Reliability Engineer, Platform Engineer, DevOps Engineer, or in a similar cloud infrastructure role.
  • Strong scripting and programming skills using Python, Go, Bash, or comparable languages.
  • Hands-on experience with Kubernetes, Docker, cloud platforms (AWS, Azure, or GCP).
  • Experience with Infrastructure as Code solutions including Terraform, Pulumi, or Crossplane.
  • Solid knowledge of CI/CD platforms such as GitHub Actions, Jenkins, or TeamCity.
  • Experience with monitoring and observability technologies including Grafana, Prometheus, ELK, Tempo, or Loki.
  • Understanding of Internal Developer Platforms (IDP), developer experience (DevEx), and platform engineering principles.
  • Familiarity with cloud governance, security best practices, incident response, and ISO 27001 or similar compliance frameworks.
  • Experience leveraging AI development tools such as GitHub Copilot or ChatGPT is highly desirable.
  • Strong analytical, troubleshooting, communication, and collaboration skills with experience working in Agile environments.

Responsibilities

  • Design, build, and maintain internal developer platforms, self-service infrastructure, and platform services using modern cloud-native technologies.
  • Develop and enhance automation solutions using Python, Bash, Go, and Infrastructure as Code tools such as Terraform, Pulumi, and Crossplane.
  • Collaborate with engineering teams to design reliable, scalable, and secure cloud infrastructure while supporting CI/CD pipelines and deployment strategies.
  • Monitor production environments, define and improve SLIs/SLOs, implement observability solutions, and strengthen monitoring and alerting capabilities.
  • Participate in incident response, troubleshoot production issues, conduct root cause analysis, and drive post-incident improvements.
  • Establish and maintain cloud governance, security standards, compliance initiatives, and cost optimization strategies.
  • Continuously reduce operational toil through automation and AI-assisted development practices while promoting Site Reliability Engineering principles across teams.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now