Apply

SRE Leader

Posted 5 days agoViewed

View full description

💎 Seniority level: Senior, 16 years

📍 Location: India

🏢 Company: Weekday AI👥 1-10💰 over 3 years agoE-CommerceFashion

⏳ Experience: 16 years

🪄 Skills: AWSDockerLeadershipPythonSQLBashGCPKubernetesAzureGoGrafanaPrometheusNosqlAnalytical SkillsCI/CDProblem SolvingDevOpsTerraformMicroservicesNetworkingAnsibleSoftware Engineering

Requirements:
  • 8+ years of experience in Software Engineering, DevOps, or Site Reliability Engineering (SRE).
  • 3+ years of leadership experience, managing teams in an operational environment.
  • Expertise in cloud platforms such as AWS, GCP, or Azure.
  • Hands-on experience with Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
  • Proficiency in programming/scripting languages such as Python, Go, or Bash.
  • Strong experience with Kubernetes, Docker, and container orchestration.
  • In-depth knowledge of monitoring, logging, and observability tools like Prometheus, Grafana, ELK, or Datadog.
  • Expertise in CI/CD pipelines, automation, and deployment strategies.
  • Strong problem-solving and analytical skills, with a data-driven approach.
  • Excellent communication and leadership abilities to drive collaboration and innovation.
Responsibilities:
  • Lead and mentor a team of Site Reliability Engineers (SREs), fostering a culture of operational excellence and continuous improvement.
  • Develop and implement SRE best practices, including monitoring, alerting, and incident response strategies.
  • Design and build scalable, highly available, and resilient architectures to ensure system reliability.
  • Collaborate closely with engineering teams to optimize system performance, reliability, and capacity planning.
  • Drive automation initiatives to minimize manual tasks and enhance operational efficiency.
  • Define and enforce SLAs, SLOs, and error budgets to maintain the right balance between reliability and development velocity.
  • Lead incident management, root cause analysis, and post-mortem processes, ensuring continuous improvement.
  • Work with security teams to uphold compliance standards and implement best practices in infrastructure and operations.
  • Research, evaluate, and integrate new tools, technologies, and methodologies to enhance reliability and efficiency.
Apply