Site Reliability Engineer
IndiaFull-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- DockerPythonGCPKubernetesGoGrafanaPrometheusLinuxTerraform
Requirements
- Strong hands-on experience with cloud platforms, particularly Google Cloud, and infrastructure-as-code tools such as Terraform.
- Solid understanding of microservices architectures, containerization, and distributed systems, including production use of Kubernetes and Docker.
- Strong SRE mindset focused on automation, scalability, observability, and reliability engineering principles.
- Practical experience in Linux system administration, networking fundamentals, and security concepts such as PKI and secure service-to-service communication.
- Strong problem-solving skills, ability to work in high-pressure environments, and comfort with incident management and operational ownership.
Responsibilities
- Operate and optimize containerized environments using Kubernetes and service mesh technologies such as Istio, ensuring high availability and performance across distributed systems.
- Build automation and operational tooling using Go, Python, and Shell scripting to reduce manual intervention and improve system efficiency.
- Design and maintain observability stacks using Prometheus, Grafana, and Loki for proactive incident detection and resolution.
- Troubleshoot and resolve complex issues across networking, storage, and system performance layers in large-scale distributed environments.
- Participate in on-call rotations, incident response, and postmortem analysis to continuously improve reliability and operational maturity.
- Collaborate with AI/ML and data engineering teams to ensure infrastructure readiness for model training, inference workloads, and data pipelines.
View Full Description & ApplyYou'll be redirected to the employer's site