Site Reliability Engineer II - Platform Engineer
L
LivePersonCustomer Engagement
India (Remote)Full-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5-8 years
- Required Skills
- AWSPythonAgileGCPKubernetesSCRUMGoGrafanaPrometheusLinuxTerraformDatadogHelm
Requirements
- 5-8 years of experience as a Site Reliability Engineer, Platform Engineer, or DevOps Engineer.
- Hands-on experience managing Kubernetes clusters (GKE, EKS) in GCP and AWS.
- Strong knowledge of Terraform, Helm, and GitLab CI/CD pipelines.
- Proficiency in Python, Go, or Shell scripting for automation and tooling.
- Experience implementing and managing observability stacks (Prometheus, Grafana, Datadog).
- Deep understanding of Linux systems, cloud networking, and container orchestration concepts.
- Experience working in Agile/Scrum environments and partnering closely with developers.
- Excellent analytical skills with a proactive attitude — able to question assumptions and escalate potential risks early.
- Experience with ArgoCD or Flux (GitOps-based workflows) (Good to Have).
- Familiarity with service mesh (Istio, Linkerd) or API gateways (Good to Have).
- Knowledge of cloud cost optimization, autoscaling, or security best practices (Good to Have).
- Experience with incident management tools such as PagerDuty, ServiceNOW (Good to Have).
Responsibilities
- Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements.
- Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable.
- Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS).
- Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS.
- Standardize service deployments using Helm for templating and versioned releases.
- Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance.
- Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation.
- Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency.
- Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents.
- Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence.
- Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks.
View Full Description & ApplyYou'll be redirected to the employer's site