Site Reliability Engineer

IndiaFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
DockerPythonGCPKubernetesGoGrafanaPrometheusLinuxTerraform

Requirements

  • Strong hands-on experience with cloud platforms, particularly Google Cloud, and infrastructure-as-code tools such as Terraform.
  • Solid understanding of microservices architectures, containerization, and distributed systems, including production use of Kubernetes and Docker.
  • Strong SRE mindset focused on automation, scalability, observability, and reliability engineering principles.
  • Practical experience in Linux system administration, networking fundamentals, and security concepts such as PKI and secure service-to-service communication.
  • Strong problem-solving skills, ability to work in high-pressure environments, and comfort with incident management and operational ownership.

Responsibilities

  • Operate and optimize containerized environments using Kubernetes and service mesh technologies such as Istio, ensuring high availability and performance across distributed systems.
  • Build automation and operational tooling using Go, Python, and Shell scripting to reduce manual intervention and improve system efficiency.
  • Design and maintain observability stacks using Prometheus, Grafana, and Loki for proactive incident detection and resolution.
  • Troubleshoot and resolve complex issues across networking, storage, and system performance layers in large-scale distributed environments.
  • Participate in on-call rotations, incident response, and postmortem analysis to continuously improve reliability and operational maturity.
  • Collaborate with AI/ML and data engineering teams to ensure infrastructure readiness for model training, inference workloads, and data pipelines.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now