Senior Consultant Service Reliability Engineer

New
IndiaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
AWSDockerPythonBashGCPKubernetesAzureGoDevOps

Requirements

  • Strong experience in Site Reliability Engineering, DevOps, infrastructure engineering, or related fields.
  • Hands-on expertise with programming or scripting languages such as Python, Go, or Bash.
  • Solid understanding of at least one major cloud platform including AWS, Azure, or GCP.
  • Experience with observability and monitoring tools such as Grafana, Datadog, ELK Stack, Dynatrace, New Relic, or similar platforms.
  • Familiarity with DevOps and GitOps methodologies and CI/CD practices.
  • Strong knowledge of containerization and orchestration technologies including Kubernetes, Docker, AWS EKS, or similar platforms.
  • Understanding of microservices architecture, RESTful APIs, serverless systems, and modern cloud-native design patterns.
  • Ability to troubleshoot complex infrastructure and production issues using logs, metrics, and monitoring data.
  • Excellent communication, collaboration, and stakeholder management skills.
  • Strong ownership mindset with the ability to work independently in high-pressure environments.
  • Flexibility to participate in rotational and on-call support schedules.

Responsibilities

  • Improve system reliability and resilience by implementing fault-tolerant architectures and automation strategies.
  • Enhance monitoring, observability, and alerting systems to reduce operational overhead and improve incident detection and response times.
  • Manage production incidents, coordinate communication with stakeholders, and conduct root cause analysis investigations.
  • Collaborate with development teams to improve application reliability, scalability, and operational readiness.
  • Integrate observability and automation practices into CI/CD pipelines and DevOps workflows.
  • Monitor system performance and optimize infrastructure to meet SLA and SLO objectives.
  • Implement and maintain cloud-native infrastructure solutions aligned with reliability and security best practices.
  • Drive continuous improvement initiatives including chaos engineering and proactive reliability testing.
  • Build operational dashboards, metrics, and logging solutions to improve visibility across distributed systems.
  • Support 24x7 operational needs through rotational or on-call responsibilities when required.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now