Cloud Reliability & Recovery Engineer
Remote - IndiaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles 3+ years of hands-on AWS experience in production environments at scale
- Required Skills
- AWSPythonBashKubernetesCI/CDTerraformGitHub ActionsCloudFormationHIPAA
Requirements
- 5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles
- 3+ years of hands-on AWS experience in production environments at scale
- Proven delivery of multi-region DR architectures with defined and tested RTO/RPO targets
- Expert-level proficiency with core AWS resilience services
- Strong scripting skills: Python, Bash, or PowerShell for automation and orchestration
- Experience with Infrastructure as Code: Terraform and/or AWS CloudFormation
- Solid understanding of networking fundamentals: VPC, TGW, Direct Connect, VPN, DNS failover
- Excellent written and verbal communication; able to produce executive-level DR reports
- AWS Certified Solutions Architect – Professional or AWS Certified DevOps Engineer – Professional (Preferred)
- AWS Certified Advanced Networking – Specialty certification (Preferred)
- Experience with AWS Resilience Hub for automated resilience assessments and policy enforcement (Preferred)
- Familiarity with CloudEndure / AWS Elastic Disaster Recovery (DRS) for workload replication (Preferred)
- Knowledge of Kubernetes-based DR (EKS multi-region, Velero backups, ArgoCD GitOps failover) (Preferred)
- Hands-on experience with serverless DR patterns (Lambda, API Gateway, DynamoDB) (Preferred)
Responsibilities
- Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets
- Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront
- Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions
- Administer AWS Backup across all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) with policy-based automation
- Author and maintain Terraform/CloudFormation templates for all BCP/DR infrastructure components
- Automate DR testing pipelines through CI/CD (CodePipeline, CodeBuild, GitHub Actions)
- Build CloudWatch dashboards, alarms, and composite alarms for availability and DR-readiness indicators
- Participate in on-call rotations and lead DR incident response; conduct post-incident reviews (PIRs)
- Conduct regular BCP/DR tabletop exercises and full failover simulations to validate recovery procedures and improve organizational readiness, document results and action items.
- Ensure DR controls meet SOC 2, ISO 22301, NIST 800-53, and HIPAA/PCI requirements as applicable
View Full Description & ApplyYou'll be redirected to the employer's site