Senior Site Reliability Engineer - AI Platform

New

CanadaFull-TimeSenior

Salary125,200 - 132,500 CAD per year

Apply NowOpens the employer's application page

Job Details

Experience: 6–8+ years
Required Skills: AWSDockerPythonBashKubernetesTerraformGitHub ActionsDatadog

6–8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering, with senior-level technical ownership responsibilities.
Deep expertise in AWS and distributed systems architecture, including multi-region, high-availability environments.
Highly skilled in Kubernetes, Docker, Terraform, and GitOps practices, with strong infrastructure-as-code experience.
Hands-on experience with observability platforms such as Datadog, including SLO monitoring, alerting, tracing, and log analytics.
Proficient in scripting and development (Python and/or Bash), with solid understanding of microservices architectures.
Strong experience designing and optimizing CI/CD pipelines (e.g., GitHub Actions, Bitbucket Pipelines).
Understand reliability challenges in large-scale systems and can translate complex technical risks into actionable engineering solutions.
Strong communication and collaboration skills, with the ability to influence cross-functional teams and mentor engineers.
Experience with AI/ML infrastructure, LLM systems, or agent-based architectures is a strong advantage.

Define and own service reliability standards, including SLOs, SLIs, and error budgets, ensuring consistent performance across all production systems.
Design and implement reliability patterns for AI agent pipelines, including observability, failure detection, and safe degradation mechanisms.
Architect and improve multi-region infrastructure strategies, driving high availability, disaster recovery readiness, and blast radius control.
Lead incident response and postmortem processes, ensuring durable fixes and continuous improvement of system resilience.
Serve as the primary reliability partner for engineering and AI teams, influencing architecture, deployment strategies, and system design decisions.
Own observability and platform tooling, including service catalog management, Datadog configuration, and AI workload monitoring.
Develop CI/CD standards and enable self-service developer platforms to improve deployment velocity and system reliability.
Contribute to FinOps initiatives by improving cost visibility and optimizing infrastructure efficiency across cloud environments.

View Full Description & ApplyYou'll be redirected to the employer's site