Site Reliability Engineer II - Platform Engineer

LivePersonCustomer Engagement

India (Remote)Full-TimeMiddle

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5-8 years
Required Skills: AWSPythonAgileGCPKubernetesSCRUMGoGrafanaPrometheusLinuxTerraformDatadogHelm

5-8 years of experience as a Site Reliability Engineer, Platform Engineer, or DevOps Engineer.
Hands-on experience managing Kubernetes clusters (GKE, EKS) in GCP and AWS.
Strong knowledge of Terraform, Helm, and GitLab CI/CD pipelines.
Proficiency in Python, Go, or Shell scripting for automation and tooling.
Experience implementing and managing observability stacks (Prometheus, Grafana, Datadog).
Deep understanding of Linux systems, cloud networking, and container orchestration concepts.
Experience working in Agile/Scrum environments and partnering closely with developers.
Excellent analytical skills with a proactive attitude — able to question assumptions and escalate potential risks early.
Experience with ArgoCD or Flux (GitOps-based workflows) (Good to Have).
Familiarity with service mesh (Istio, Linkerd) or API gateways (Good to Have).
Knowledge of cloud cost optimization, autoscaling, or security best practices (Good to Have).
Experience with incident management tools such as PagerDuty, ServiceNOW (Good to Have).

Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements.
Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable.
Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS).
Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS.
Standardize service deployments using Helm for templating and versioned releases.
Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance.
Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation.
Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency.
Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents.
Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence.
Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks.

View Full Description & ApplyYou'll be redirected to the employer's site