Lead Site Reliability Engineer
New
J
Juniper Square FinTech
India, 27 U.S. states, 2 Canadian Provinces, Luxembourg, and EnglandFull-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Languages
- English
- Experience
- 7-10 years
- Required Skills
- AWSPostgreSQLPythonArtificial IntelligenceKubernetesGoCI/CDLinuxTerraformMicroservicesCloudFormationLLM
Requirements
- 7-10 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in a production cloud environment.
- 5+ years of hands-on experience with AWS cloud services across compute, networking, storage, and security.
- 5+ years managing Linux-oriented production environments at scale.
- 5+ years using Infrastructure-as-Code (Terraform, CDK, CloudFormation) and/or GitOps best practices.
- 3+ years operating and troubleshooting production Kubernetes environments.
- 3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost pillars.
- 3+ years in cloud security best practices including IAM, secrets management, network security, and compliance.
- 3+ years working with PostgreSQL in production: performance tuning, replication, backup, and recovery.
- Demonstrated track record of leading multi-person technical projects from scoping through delivery.
- Strong general programming skills; comfort writing automation scripts and tooling in Python, Go, or similar.
- Deep knowledge of observability tooling — metrics, logging, distributed tracing — and how to use them to drive reliability.
- Solid understanding of data retention, backup, and recovery processes across cloud-native systems.
- Experience with CI/CD pipelines, release management, and deployment automation.
- Familiarity with service mesh, API gateway patterns, and microservices architectures.
- Experience using AI-assisted workflows across the SDLC, with an emphasis on production reliability, operability, and maintainability of large-scale systems.
- Hands-on experience integrating LLMs or AI systems into production environments, with a focus on reliability, latency, observability, and failure handling.
Responsibilities
- Own and drive the technical direction for your team's infrastructure systems, making architectural decisions that balance reliability, scalability, and cost.
- Design systems of moderate to high complexity using distributed systems best practices; anticipate future use cases and minimize technical debt.
- Conduct architectural reviews and advance design patterns across the organization.
- Own the reliability posture of team-owned services — establish SLOs, monitor SLAs, and hold the team accountable to them.
- Lead incident response for complex, multi-service issues; systematically debug, identify root causes, and ensure issues do not recur.
- Act as DRI (Directly Responsible Individual) for medium-to-large SRE projects spanning months and involving cross-team collaboration.
- Partner with Engineering Managers and Product Managers to scope roadmap initiatives, break down work into actionable increments, and commit to delivery plans.
- Participate in and help lead the on-call rotation; ensure production systems are appropriately instrumented.
- Define and enforce security best practices across team-owned systems; proactively surface gaps to senior leadership.
View Full Description & ApplyYou'll be redirected to the employer's site