Staff Software Developer, Production Engineering
New
CanadaFull-TimeStaff
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- KubernetesSoftware EngineeringHelm
Requirements
- 8+ years of software engineering experience, including substantial exposure to platform engineering, infrastructure, site reliability engineering (SRE), or related disciplines.
- Proven track record of improving reliability at scale through operational standards, automation, incident reduction initiatives, or platform-wide engineering improvements.
- Strong expertise in backend systems, distributed architectures, and diagnosing complex production issues across interconnected services.
- Experience conducting load testing, performance analysis, capacity planning, and translating technical findings into actionable engineering solutions.
- Deep understanding of modern cloud-native technologies and deployment ecosystems, including Kubernetes, Helm, Argo, and related tooling.
- Demonstrated ability to influence technical direction and drive adoption of best practices across teams without direct managerial authority.
- Strong communication and stakeholder management skills, with the ability to present recommendations to both technical and senior leadership audiences.
- Systems-thinking mindset with a focus on root-cause analysis, long-term problem prevention, and scalable engineering solutions.
- Comfortable navigating ambiguity, balancing competing priorities, and driving outcomes in a fast-moving environment.
- Interest in emerging technologies, including AI-assisted engineering and operational tooling.
Responsibilities
- Design and implement platform-level reliability improvements, including guardrails, engineering standards, and best practices that reduce service failures and operational risk.
- Develop and enhance tools that improve incident detection, response, mitigation, and recovery, including support for AI-assisted operational workflows.
- Lead investigations into performance, scalability, and load-testing outcomes, transforming findings into measurable reliability improvements across critical systems.
- Partner with platform and product engineering teams to review architectures, assess production readiness, and promote resilient engineering practices.
- Identify recurring operational issues and design long-term solutions that eliminate root causes rather than addressing individual incidents.
- Influence engineering teams through technical leadership, mentorship, and collaboration, driving adoption of reliability-focused standards across the organization.
- Contribute to reliability planning, risk assessment discussions, and cross-functional initiatives aimed at improving uptime and user experience.
- Support continuous improvement efforts by helping define operational metrics, incident prevention strategies, and engineering excellence initiatives.
View Full Description & ApplyYou'll be redirected to the employer's site