Senior Consultant Service Reliability Engineer
New
IndiaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- AWSDockerPythonBashGCPKubernetesAzureGoDevOps
Requirements
- Strong experience in Site Reliability Engineering, DevOps, infrastructure engineering, or related fields.
- Hands-on expertise with programming or scripting languages such as Python, Go, or Bash.
- Solid understanding of at least one major cloud platform including AWS, Azure, or GCP.
- Experience with observability and monitoring tools such as Grafana, Datadog, ELK Stack, Dynatrace, New Relic, or similar platforms.
- Familiarity with DevOps and GitOps methodologies and CI/CD practices.
- Strong knowledge of containerization and orchestration technologies including Kubernetes, Docker, AWS EKS, or similar platforms.
- Understanding of microservices architecture, RESTful APIs, serverless systems, and modern cloud-native design patterns.
- Ability to troubleshoot complex infrastructure and production issues using logs, metrics, and monitoring data.
- Excellent communication, collaboration, and stakeholder management skills.
- Strong ownership mindset with the ability to work independently in high-pressure environments.
- Flexibility to participate in rotational and on-call support schedules.
Responsibilities
- Improve system reliability and resilience by implementing fault-tolerant architectures and automation strategies.
- Enhance monitoring, observability, and alerting systems to reduce operational overhead and improve incident detection and response times.
- Manage production incidents, coordinate communication with stakeholders, and conduct root cause analysis investigations.
- Collaborate with development teams to improve application reliability, scalability, and operational readiness.
- Integrate observability and automation practices into CI/CD pipelines and DevOps workflows.
- Monitor system performance and optimize infrastructure to meet SLA and SLO objectives.
- Implement and maintain cloud-native infrastructure solutions aligned with reliability and security best practices.
- Drive continuous improvement initiatives including chaos engineering and proactive reliability testing.
- Build operational dashboards, metrics, and logging solutions to improve visibility across distributed systems.
- Support 24x7 operational needs through rotational or on-call responsibilities when required.
View Full Description & ApplyYou'll be redirected to the employer's site