Lead Site Reliability Engineer

New
India, 27 U.S. states, 2 Canadian Provinces, Luxembourg, and EnglandFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Languages
English
Experience
7-10 years
Required Skills
AWSPostgreSQLPythonArtificial IntelligenceKubernetesGoCI/CDLinuxTerraformMicroservicesCloudFormationLLM

Requirements

  • 7-10 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in a production cloud environment.
  • 5+ years of hands-on experience with AWS cloud services across compute, networking, storage, and security.
  • 5+ years managing Linux-oriented production environments at scale.
  • 5+ years using Infrastructure-as-Code (Terraform, CDK, CloudFormation) and/or GitOps best practices.
  • 3+ years operating and troubleshooting production Kubernetes environments.
  • 3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost pillars.
  • 3+ years in cloud security best practices including IAM, secrets management, network security, and compliance.
  • 3+ years working with PostgreSQL in production: performance tuning, replication, backup, and recovery.
  • Demonstrated track record of leading multi-person technical projects from scoping through delivery.
  • Strong general programming skills; comfort writing automation scripts and tooling in Python, Go, or similar.
  • Deep knowledge of observability tooling — metrics, logging, distributed tracing — and how to use them to drive reliability.
  • Solid understanding of data retention, backup, and recovery processes across cloud-native systems.
  • Experience with CI/CD pipelines, release management, and deployment automation.
  • Familiarity with service mesh, API gateway patterns, and microservices architectures.
  • Experience using AI-assisted workflows across the SDLC, with an emphasis on production reliability, operability, and maintainability of large-scale systems.
  • Hands-on experience integrating LLMs or AI systems into production environments, with a focus on reliability, latency, observability, and failure handling.

Responsibilities

  • Own and drive the technical direction for your team's infrastructure systems, making architectural decisions that balance reliability, scalability, and cost.
  • Design systems of moderate to high complexity using distributed systems best practices; anticipate future use cases and minimize technical debt.
  • Conduct architectural reviews and advance design patterns across the organization.
  • Own the reliability posture of team-owned services — establish SLOs, monitor SLAs, and hold the team accountable to them.
  • Lead incident response for complex, multi-service issues; systematically debug, identify root causes, and ensure issues do not recur.
  • Act as DRI (Directly Responsible Individual) for medium-to-large SRE projects spanning months and involving cross-team collaboration.
  • Partner with Engineering Managers and Product Managers to scope roadmap initiatives, break down work into actionable increments, and commit to delivery plans.
  • Participate in and help lead the on-call rotation; ensure production systems are appropriately instrumented.
  • Define and enforce security best practices across team-owned systems; proactively surface gaps to senior leadership.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now