Site Reliability Engineer II - Platform Engineering
India (Remote)Full-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5-8 years
- Required Skills
- PythonAgileKubernetesSCRUMGoGrafanaPrometheusLinuxTerraformDatadogHelm
Requirements
- 5-8 years of experience as a Site Reliability Engineer, Platform Engineer, or DevOps Engineer
- Hands-on experience managing Kubernetes clusters (GKE, EKS) in GCP and AWS
- Strong knowledge of Terraform, Helm, and GitLab CI/CD pipelines
- Proficiency in Python, Go, or Shell scripting for automation and tooling
- Experience implementing and managing observability stacks (Prometheus, Grafana, Datadog)
- Deep understanding of Linux systems, cloud networking, and container orchestration concepts
- Experience working in Agile/Scrum environments
- Excellent analytical skills with a proactive attitude
Responsibilities
- Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements
- Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable
- Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS)
- Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS
- Standardize service deployments using Helm for templating and versioned releases
- Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance
- Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation
- Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency
- Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents
- Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence
- Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks
View Full Description & ApplyYou'll be redirected to the employer's site