Site Reliability Engineer II - Platform Engineering

India (Remote)Full-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5-8 years
Required Skills
PythonAgileKubernetesSCRUMGoGrafanaPrometheusLinuxTerraformDatadogHelm

Requirements

  • 5-8 years of experience as a Site Reliability Engineer, Platform Engineer, or DevOps Engineer
  • Hands-on experience managing Kubernetes clusters (GKE, EKS) in GCP and AWS
  • Strong knowledge of Terraform, Helm, and GitLab CI/CD pipelines
  • Proficiency in Python, Go, or Shell scripting for automation and tooling
  • Experience implementing and managing observability stacks (Prometheus, Grafana, Datadog)
  • Deep understanding of Linux systems, cloud networking, and container orchestration concepts
  • Experience working in Agile/Scrum environments
  • Excellent analytical skills with a proactive attitude

Responsibilities

  • Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements
  • Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable
  • Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS)
  • Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS
  • Standardize service deployments using Helm for templating and versioned releases
  • Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance
  • Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation
  • Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency
  • Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents
  • Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence
  • Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now