Staff Software Developer, Production Engineering

New

CanadaFull-TimeStaff

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

8+ years of software engineering experience, including substantial exposure to platform engineering, infrastructure, site reliability engineering (SRE), or related disciplines.
Proven track record of improving reliability at scale through operational standards, automation, incident reduction initiatives, or platform-wide engineering improvements.
Strong expertise in backend systems, distributed architectures, and diagnosing complex production issues across interconnected services.
Experience conducting load testing, performance analysis, capacity planning, and translating technical findings into actionable engineering solutions.
Deep understanding of modern cloud-native technologies and deployment ecosystems, including Kubernetes, Helm, Argo, and related tooling.
Demonstrated ability to influence technical direction and drive adoption of best practices across teams without direct managerial authority.
Strong communication and stakeholder management skills, with the ability to present recommendations to both technical and senior leadership audiences.
Systems-thinking mindset with a focus on root-cause analysis, long-term problem prevention, and scalable engineering solutions.
Comfortable navigating ambiguity, balancing competing priorities, and driving outcomes in a fast-moving environment.
Interest in emerging technologies, including AI-assisted engineering and operational tooling.

Design and implement platform-level reliability improvements, including guardrails, engineering standards, and best practices that reduce service failures and operational risk.
Develop and enhance tools that improve incident detection, response, mitigation, and recovery, including support for AI-assisted operational workflows.
Lead investigations into performance, scalability, and load-testing outcomes, transforming findings into measurable reliability improvements across critical systems.
Partner with platform and product engineering teams to review architectures, assess production readiness, and promote resilient engineering practices.
Identify recurring operational issues and design long-term solutions that eliminate root causes rather than addressing individual incidents.
Influence engineering teams through technical leadership, mentorship, and collaboration, driving adoption of reliability-focused standards across the organization.
Contribute to reliability planning, risk assessment discussions, and cross-functional initiatives aimed at improving uptime and user experience.
Support continuous improvement efforts by helping define operational metrics, incident prevention strategies, and engineering excellence initiatives.

View Full Description & ApplyYou'll be redirected to the employer's site