Senior Site Reliability Engineer
New
Canada and the US Pacific NorthwestFull-TimeSenior
Salary145,000 - 185,000 CAD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSDockerPythonBashKubernetesGrafanaPrometheusLinuxTerraformGitHub ActionsHelm
Requirements
- 5+ years in SRE, DevOps, or infrastructure engineering roles
- Track record of operating production systems across multiple regions
- Terraform: Modules, state management, and multi-environment patterns
- AWS depth: VPC, IAM, EKS, S3, and CloudWatch
- Kubernetes expertise: Cluster operations, autoscaling, RBAC, and Helm
- CI/CD and GitOps: GitHub Actions, ArgoCD, or similar workflows
- Networking fundamentals: CIDR, DNS, load balancing, VPN, and cross-region connectivity
- Observability: Prometheus and Grafana
- Scripting: Python and Bash for tooling and automation
- Cross-platform familiarity: Working knowledge of both Linux and Windows environments
- Operational experience supporting Windows-based workloads
- Comfortable in a fast-moving startup with evolving priorities
- Take ownership of systems while collaborating closely with other teams
- Pragmatic about tradeoffs between speed, reliability, and complexity
Responsibilities
- Design, build, and maintain multi-region AWS infrastructure using Terraform
- Operate and scale EKS clusters across production regions: autoscaling, node lifecycle, workload health
- Manage networking across environments: VPC design, DNS, load balancing, and cross-region connectivity
- Support infrastructure changes, migrations, and expansions into new regions
- Contribute to and improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize
- Help build and run incident management processes: severity definitions, escalation paths, on-call practices
- Lead incident response, debugging, and root-cause analysis
- Write postmortems and drive systemic reliability improvements
- Improve observability across metrics, logging, tracing, and dashboards
- Support GPU and batch workloads running on Kubernetes
- Provide security-conscious feedback on platform architecture decisions
- Own cloud IAM governance: roles, policies, and access boundaries across accounts and services
- Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires
- Improve CI/CD pipelines and infrastructure validation
- Support engineers with infrastructure debugging, environment setup, and performance issues
- Contribute to tooling and automation in Python and Bash
View Full Description & ApplyYou'll be redirected to the employer's site