Senior Site Reliability Engineer
New
Remote - Canada / Remote - Pacific Northwest AreaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSPythonBashKubernetesGrafanaPrometheusCI/CDTerraform
Requirements
- 5+ years in SRE, DevOps, or infrastructure engineering roles.
- Demonstrated track record of operating production systems across multiple regions.
- Deep expertise in AWS services: VPC, IAM, EKS, S3, and CloudWatch.
- Advanced Kubernetes skills: cluster operations, autoscaling, RBAC, and Helm.
- Proficiency with Terraform modules, state management, and multi-environment patterns.
- Experience with CI/CD and GitOps workflows (e.g., GitHub Actions, ArgoCD).
- Strong understanding of networking fundamentals: CIDR, DNS, load balancing, and VPN.
- Familiarity with observability tools such as Prometheus and Grafana.
- Comfort with Python and Bash for tooling and automation.
- Working knowledge of both Linux and Windows environments.
Responsibilities
- Design, build, and maintain multi-region AWS infrastructure using Terraform.
- Operate and scale EKS clusters, including autoscaling, node lifecycle, and workload health management.
- Manage network infrastructure including VPC design, DNS, load balancing, and cross-region connectivity.
- Improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.
- Lead incident management processes, debugging, root-cause analysis, and postmortems.
- Enhance observability using metrics, logging, tracing, and dashboards.
- Govern cloud IAM policies and roles across services.
- Develop internal platform tooling and automation using Python and Bash.
View Full Description & ApplyYou'll be redirected to the employer's site