Staff Distributed Systems Engineer - Collaboration
New
Remote-first (United States; BC & ON, Canada)Full-TimeStaff
Salary164,000 USD - 328,000 CAD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- At least 7, preferably 10+ years
- Required Skills
- Node.jsPythonArtificial IntelligenceJavaGoRust
Requirements
- BA/BS degree or equivalent experience
- At least 7, preferably 10+ years of building and operating large-scale production distributed systems where latency, correctness, and reliability (99.99% uptime) are non-negotiable.
- Deep backend systems experience in one or more modern server environments (e.g., Java, Go, Rust, Python, Node.js, etc.), with the ability to ramp and adapt quickly in new stacks.
- Expertise with distributed systems, concurrency, scaling, and debugging multi-layer systems.
- Strong operational judgment: you define SLIs/SLOs, build observability, and improve systems via incidents and feedback loops, not heroics.
- Staff behaviors: you lead multi-team initiatives, write decision-quality design docs, influence architecture beyond your immediate team, and communicate across the organization.
- Ability to make decisions with incomplete information, understand and communicate one-way vs. two-way doors, and move with urgency while keeping critical code operational.
- Stay curious and open to growth — actively building fluency in emerging technologies like AI to unlock creativity, accelerate progress, and amplify impact.
Responsibilities
- Collaborate with exceptional engineers on building systems and services for the world's largest companies.
- Lead architecture for distributed services at scale that synchronize shared state across clients, including clear correctness guarantees (eg: ordering, idempotence, convergence).
- Define concurrency and conflict-resolution semantics for concurrent changes, including trade-offs and constraints.
- Design for failure: retries, partial outages, reconnection, and safe recovery paths, with explicit degradation behavior.
- Own operational excellence: define SLIs/SLOs, instrument tracing/metrics/logging, and drive reliability improvements through incident learning.
- Drive cross-team technical alignment via design docs and decision records; unblock execution across org boundaries.
- Raise the bar through design and code reviews, mentoring, and pragmatic standardization that increases leverage.
- Deliver maintainable, tested, performant systems and evolve them with a “crawl, walk, run” plan.
- Use modern tooling (including AI-assisted coding, debugging and code review) to improve developer velocity and reduce time-to-diagnosis in production.
- Participate in engineering citizenship activities such as co-authoring engineering blogs, strengthening and improving our hiring processes, and leading internal hackathon teams.
View Full Description & ApplyYou'll be redirected to the employer's site