Senior Manager, Software Engineering (Infrastructure)
Canada (British Columbia, Ontario), London, India (Gujarat, Maharashtra, Bengaluru)Full-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years of experience in infrastructure, SRE, or cloud engineering roles, with 3+ years leading specialized engineering teams
- Required Skills
- AWSKubernetesMachine LearningDevOpsTerraformLLMMLOps
Requirements
- 8+ years of experience in infrastructure, SRE, or cloud engineering roles
- 3+ years leading specialized engineering teams
- Extensive experience with AWS
- Extensive experience with modern infrastructure-as-code (Terraform)
- Proven track record of leading teams through production incidents and complex architectural migrations
- Understanding of the unique infrastructure needs for machine learning
- Proven expertise in managing large-scale containerized environments
- Proven expertise in leveraging observability stacks to ensure platform health
- Ability to align technical roadmaps with business objectives and advocate for infrastructure investment
- Experience with FinOps or managing significant cloud budgets is a plus
- Background in supporting AI agentic workflows or autonomous orchestration systems is a plus
Responsibilities
- Lead and grow multiple teams across SRE, Cloud Infrastructure, and MLOps.
- Coach and develop engineering managers and senior individual contributors, fostering a culture of ownership and high craft.
- Build a "Platform-as-a-Product" mindset, ensuring that infrastructure and ML tooling serve as enablers for the rest of the engineering organization.
- Own the operational health of production systems, including availability, latency, and durability.
- Define and evolve SLIs, SLOs, and error budgets, moving the organization toward data-driven reliability decisions.
- Lead incident response, driving blameless postmortems and systemic improvements.
- Evolve Loopioâs cloud architecture, overseeing capacity planning, disaster recovery, and business continuity.
- Drive the MLOps roadmap, establishing standards for model deployment, monitoring, and scaling.
- Lead Cloud FinOps, ensuring our infrastructure and AI compute costs are visible, intentional, and optimized.
- Partner with Security to ensure "secure-by-default" infrastructure and robust backup/recovery strategies.
View Full Description & ApplyYou'll be redirected to the employer's site