Site Reliability Engineer II - Platform Engineer

L
LivePersonCustomer Engagement
India (Remote)Full-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5-8 years
Required Skills
AWSPythonAgileGCPKubernetesSCRUMGoGrafanaPrometheusLinuxTerraformDatadogHelm

Requirements

  • 5-8 years of experience as a Site Reliability Engineer, Platform Engineer, or DevOps Engineer.
  • Hands-on experience managing Kubernetes clusters (GKE, EKS) in GCP and AWS.
  • Strong knowledge of Terraform, Helm, and GitLab CI/CD pipelines.
  • Proficiency in Python, Go, or Shell scripting for automation and tooling.
  • Experience implementing and managing observability stacks (Prometheus, Grafana, Datadog).
  • Deep understanding of Linux systems, cloud networking, and container orchestration concepts.
  • Experience working in Agile/Scrum environments and partnering closely with developers.
  • Excellent analytical skills with a proactive attitude — able to question assumptions and escalate potential risks early.
  • Experience with ArgoCD or Flux (GitOps-based workflows) (Good to Have).
  • Familiarity with service mesh (Istio, Linkerd) or API gateways (Good to Have).
  • Knowledge of cloud cost optimization, autoscaling, or security best practices (Good to Have).
  • Experience with incident management tools such as PagerDuty, ServiceNOW (Good to Have).

Responsibilities

  • Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements.
  • Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable.
  • Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS).
  • Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS.
  • Standardize service deployments using Helm for templating and versioned releases.
  • Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance.
  • Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation.
  • Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency.
  • Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents.
  • Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence.
  • Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now