Lead Site Reliability Engineer

New

Juniper Square FinTech

India, 27 U.S. states, 2 Canadian Provinces, Luxembourg, and EnglandFull-TimeLead

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Languages: English
Experience: 7-10 years
Required Skills: AWSPostgreSQLPythonArtificial IntelligenceKubernetesGoCI/CDLinuxTerraformMicroservicesCloudFormationLLM

Requirements

7-10 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in a production cloud environment.
5+ years of hands-on experience with AWS cloud services across compute, networking, storage, and security.
5+ years managing Linux-oriented production environments at scale.
5+ years using Infrastructure-as-Code (Terraform, CDK, CloudFormation) and/or GitOps best practices.
3+ years operating and troubleshooting production Kubernetes environments.
3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost pillars.
3+ years in cloud security best practices including IAM, secrets management, network security, and compliance.
3+ years working with PostgreSQL in production: performance tuning, replication, backup, and recovery.
Demonstrated track record of leading multi-person technical projects from scoping through delivery.
Strong general programming skills; comfort writing automation scripts and tooling in Python, Go, or similar.
Deep knowledge of observability tooling — metrics, logging, distributed tracing — and how to use them to drive reliability.
Solid understanding of data retention, backup, and recovery processes across cloud-native systems.
Experience with CI/CD pipelines, release management, and deployment automation.
Familiarity with service mesh, API gateway patterns, and microservices architectures.
Experience using AI-assisted workflows across the SDLC, with an emphasis on production reliability, operability, and maintainability of large-scale systems.
Hands-on experience integrating LLMs or AI systems into production environments, with a focus on reliability, latency, observability, and failure handling.

Responsibilities

Own and drive the technical direction for your team's infrastructure systems, making architectural decisions that balance reliability, scalability, and cost.
Design systems of moderate to high complexity using distributed systems best practices; anticipate future use cases and minimize technical debt.
Conduct architectural reviews and advance design patterns across the organization.
Own the reliability posture of team-owned services — establish SLOs, monitor SLAs, and hold the team accountable to them.
Lead incident response for complex, multi-service issues; systematically debug, identify root causes, and ensure issues do not recur.
Act as DRI (Directly Responsible Individual) for medium-to-large SRE projects spanning months and involving cross-team collaboration.
Partner with Engineering Managers and Product Managers to scope roadmap initiatives, break down work into actionable increments, and commit to delivery plans.
Participate in and help lead the on-call rotation; ensure production systems are appropriately instrumented.
Define and enforce security best practices across team-owned systems; proactively surface gaps to senior leadership.

View Full Description & ApplyYou'll be redirected to the employer's site