Staff Software Developer, Production Engineering

New
CanadaFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
8+ years
Required Skills
KubernetesSoftware EngineeringHelm

Requirements

  • 8+ years of software engineering experience, including substantial exposure to platform engineering, infrastructure, site reliability engineering (SRE), or related disciplines.
  • Proven track record of improving reliability at scale through operational standards, automation, incident reduction initiatives, or platform-wide engineering improvements.
  • Strong expertise in backend systems, distributed architectures, and diagnosing complex production issues across interconnected services.
  • Experience conducting load testing, performance analysis, capacity planning, and translating technical findings into actionable engineering solutions.
  • Deep understanding of modern cloud-native technologies and deployment ecosystems, including Kubernetes, Helm, Argo, and related tooling.
  • Demonstrated ability to influence technical direction and drive adoption of best practices across teams without direct managerial authority.
  • Strong communication and stakeholder management skills, with the ability to present recommendations to both technical and senior leadership audiences.
  • Systems-thinking mindset with a focus on root-cause analysis, long-term problem prevention, and scalable engineering solutions.
  • Comfortable navigating ambiguity, balancing competing priorities, and driving outcomes in a fast-moving environment.
  • Interest in emerging technologies, including AI-assisted engineering and operational tooling.

Responsibilities

  • Design and implement platform-level reliability improvements, including guardrails, engineering standards, and best practices that reduce service failures and operational risk.
  • Develop and enhance tools that improve incident detection, response, mitigation, and recovery, including support for AI-assisted operational workflows.
  • Lead investigations into performance, scalability, and load-testing outcomes, transforming findings into measurable reliability improvements across critical systems.
  • Partner with platform and product engineering teams to review architectures, assess production readiness, and promote resilient engineering practices.
  • Identify recurring operational issues and design long-term solutions that eliminate root causes rather than addressing individual incidents.
  • Influence engineering teams through technical leadership, mentorship, and collaboration, driving adoption of reliability-focused standards across the organization.
  • Contribute to reliability planning, risk assessment discussions, and cross-functional initiatives aimed at improving uptime and user experience.
  • Support continuous improvement efforts by helping define operational metrics, incident prevention strategies, and engineering excellence initiatives.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now