SRE Leader

Posted 5 days agoViewed

💎 Seniority level: Senior, 16 years

📍 Location: India

🏢 Company: Weekday AI👥 1-10💰 over 3 years agoE-Commerce Fashion

⏳ Experience: 16 years

🪄 Skills: AWSDockerLeadershipPythonSQLBashGCPKubernetesAzureGoGrafanaPrometheusNosqlAnalytical SkillsCI/CDProblem SolvingDevOpsTerraformMicroservicesNetworkingAnsibleSoftware Engineering

Requirements:

8+ years of experience in Software Engineering, DevOps, or Site Reliability Engineering (SRE).
3+ years of leadership experience, managing teams in an operational environment.
Expertise in cloud platforms such as AWS, GCP, or Azure.
Hands-on experience with Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
Proficiency in programming/scripting languages such as Python, Go, or Bash.
Strong experience with Kubernetes, Docker, and container orchestration.
In-depth knowledge of monitoring, logging, and observability tools like Prometheus, Grafana, ELK, or Datadog.
Expertise in CI/CD pipelines, automation, and deployment strategies.
Strong problem-solving and analytical skills, with a data-driven approach.
Excellent communication and leadership abilities to drive collaboration and innovation.

Responsibilities:

Lead and mentor a team of Site Reliability Engineers (SREs), fostering a culture of operational excellence and continuous improvement.
Develop and implement SRE best practices, including monitoring, alerting, and incident response strategies.
Design and build scalable, highly available, and resilient architectures to ensure system reliability.
Collaborate closely with engineering teams to optimize system performance, reliability, and capacity planning.
Drive automation initiatives to minimize manual tasks and enhance operational efficiency.
Define and enforce SLAs, SLOs, and error budgets to maintain the right balance between reliability and development velocity.
Lead incident management, root cause analysis, and post-mortem processes, ensuring continuous improvement.
Work with security teams to uphold compliance standards and implement best practices in infrastructure and operations.
Research, evaluate, and integrate new tools, technologies, and methodologies to enhance reliability and efficiency.

Apply