Senior Manager, Software Engineering (Infrastructure)

Canada (British Columbia, Ontario), London, India (Gujarat, Maharashtra, Bengaluru)Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
8+ years of experience in infrastructure, SRE, or cloud engineering roles, with 3+ years leading specialized engineering teams
Required Skills
AWSKubernetesMachine LearningDevOpsTerraformLLMMLOps

Requirements

  • 8+ years of experience in infrastructure, SRE, or cloud engineering roles
  • 3+ years leading specialized engineering teams
  • Extensive experience with AWS
  • Extensive experience with modern infrastructure-as-code (Terraform)
  • Proven track record of leading teams through production incidents and complex architectural migrations
  • Understanding of the unique infrastructure needs for machine learning
  • Proven expertise in managing large-scale containerized environments
  • Proven expertise in leveraging observability stacks to ensure platform health
  • Ability to align technical roadmaps with business objectives and advocate for infrastructure investment
  • Experience with FinOps or managing significant cloud budgets is a plus
  • Background in supporting AI agentic workflows or autonomous orchestration systems is a plus

Responsibilities

  • Lead and grow multiple teams across SRE, Cloud Infrastructure, and MLOps.
  • Coach and develop engineering managers and senior individual contributors, fostering a culture of ownership and high craft.
  • Build a "Platform-as-a-Product" mindset, ensuring that infrastructure and ML tooling serve as enablers for the rest of the engineering organization.
  • Own the operational health of production systems, including availability, latency, and durability.
  • Define and evolve SLIs, SLOs, and error budgets, moving the organization toward data-driven reliability decisions.
  • Lead incident response, driving blameless postmortems and systemic improvements.
  • Evolve Loopio’s cloud architecture, overseeing capacity planning, disaster recovery, and business continuity.
  • Drive the MLOps roadmap, establishing standards for model deployment, monitoring, and scaling.
  • Lead Cloud FinOps, ensuring our infrastructure and AI compute costs are visible, intentional, and optimized.
  • Partner with Security to ensure "secure-by-default" infrastructure and robust backup/recovery strategies.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now