Senior Site Reliability Engineer - AI Platform

New
CanadaFull-TimeSenior
Salary125,200 - 132,500 CAD per year
Apply NowOpens the employer's application page

Job Details

Experience
6–8+ years
Required Skills
AWSDockerPythonBashKubernetesTerraformGitHub ActionsDatadog

Requirements

  • 6–8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering, with senior-level technical ownership responsibilities.
  • Deep expertise in AWS and distributed systems architecture, including multi-region, high-availability environments.
  • Highly skilled in Kubernetes, Docker, Terraform, and GitOps practices, with strong infrastructure-as-code experience.
  • Hands-on experience with observability platforms such as Datadog, including SLO monitoring, alerting, tracing, and log analytics.
  • Proficient in scripting and development (Python and/or Bash), with solid understanding of microservices architectures.
  • Strong experience designing and optimizing CI/CD pipelines (e.g., GitHub Actions, Bitbucket Pipelines).
  • Understand reliability challenges in large-scale systems and can translate complex technical risks into actionable engineering solutions.
  • Strong communication and collaboration skills, with the ability to influence cross-functional teams and mentor engineers.
  • Experience with AI/ML infrastructure, LLM systems, or agent-based architectures is a strong advantage.

Responsibilities

  • Define and own service reliability standards, including SLOs, SLIs, and error budgets, ensuring consistent performance across all production systems.
  • Design and implement reliability patterns for AI agent pipelines, including observability, failure detection, and safe degradation mechanisms.
  • Architect and improve multi-region infrastructure strategies, driving high availability, disaster recovery readiness, and blast radius control.
  • Lead incident response and postmortem processes, ensuring durable fixes and continuous improvement of system resilience.
  • Serve as the primary reliability partner for engineering and AI teams, influencing architecture, deployment strategies, and system design decisions.
  • Own observability and platform tooling, including service catalog management, Datadog configuration, and AI workload monitoring.
  • Develop CI/CD standards and enable self-service developer platforms to improve deployment velocity and system reliability.
  • Contribute to FinOps initiatives by improving cost visibility and optimizing infrastructure efficiency across cloud environments.
View Full Description & ApplyYou'll be redirected to the employer's site
125,200 - 132,500 CAD per year
Apply Now