Cloud Reliability & Recovery Engineer

Remote - IndiaFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles 3+ years of hands-on AWS experience in production environments at scale
Required Skills: AWSPythonBashKubernetesCI/CDTerraformGitHub ActionsCloudFormationHIPAA

5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles
3+ years of hands-on AWS experience in production environments at scale
Proven delivery of multi-region DR architectures with defined and tested RTO/RPO targets
Expert-level proficiency with core AWS resilience services
Strong scripting skills: Python, Bash, or PowerShell for automation and orchestration
Experience with Infrastructure as Code: Terraform and/or AWS CloudFormation
Solid understanding of networking fundamentals: VPC, TGW, Direct Connect, VPN, DNS failover
Excellent written and verbal communication; able to produce executive-level DR reports
AWS Certified Solutions Architect – Professional or AWS Certified DevOps Engineer – Professional (Preferred)
AWS Certified Advanced Networking – Specialty certification (Preferred)
Experience with AWS Resilience Hub for automated resilience assessments and policy enforcement (Preferred)
Familiarity with CloudEndure / AWS Elastic Disaster Recovery (DRS) for workload replication (Preferred)
Knowledge of Kubernetes-based DR (EKS multi-region, Velero backups, ArgoCD GitOps failover) (Preferred)
Hands-on experience with serverless DR patterns (Lambda, API Gateway, DynamoDB) (Preferred)

Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets
Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront
Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions
Administer AWS Backup across all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) with policy-based automation
Author and maintain Terraform/CloudFormation templates for all BCP/DR infrastructure components
Automate DR testing pipelines through CI/CD (CodePipeline, CodeBuild, GitHub Actions)
Build CloudWatch dashboards, alarms, and composite alarms for availability and DR-readiness indicators
Participate in on-call rotations and lead DR incident response; conduct post-incident reviews (PIRs)
Conduct regular BCP/DR tabletop exercises and full failover simulations to validate recovery procedures and improve organizational readiness, document results and action items.
Ensure DR controls meet SOC 2, ISO 22301, NIST 800-53, and HIPAA/PCI requirements as applicable

View Full Description & ApplyYou'll be redirected to the employer's site