Senior Software Engineer II, Developer Experience, Operational Excellence

US/Canada Eastern Time Zone, ET time ZoneFull-TimeSenior

Salary154700 - 208000 USD per year

Apply NowOpens the employer's application page

Job Details

Experience: 8+ years
Required Skills: AWSPythonGCPGoGrafanaTerraformDatadog

Requirements

8+ years of experience designing and building products in a software engineering team.
Bachelor's Degree in Computer Science/Engineering or equivalent practical experience.
3+ years of experience in infrastructure and/or platform engineering-focused teams.
Expertise in Observability and reliability, operational metrics, and data analysis.
Proven track record in architecting monitoring frameworks, SLO platforms, and automated response workflows, Datadog (or equivalent observability tooling like New Relic, Grafana).
Proven experience working on large-scale enterprise software applications.
Experience in Developer Experience (DevEx) & Internal Portals: Designing and implementing solutions/tools that centralize and simplify engineering operations.
Familiarity with cloud platforms (AWS, GCP, or the like).
Experience in implementing AI-driven automation across the software development lifecycle (SDLC) to reduce developer friction, automate repetitive technical tasks, and accelerate time-to-delivery.
Experienced at writing high-quality code (Go, Python, or equivalent) focused on infrastructure, deployment, and operations challenges.
Experience mentoring and supporting engineers and role modeling engineering practices within a technical lead capacity.
Proactive growth mindset, always looking at ways to improve the status quo.
Strong communication skills and a desire to collaborate across teams.
Experience with incident management tooling (Incident.io, PagerDuty, or equivalent).
Experienced with Infrastructure as Code (Iac) - Terraform.

Responsibilities

Design and build automated reliability and self-healing systems that protect production at scale, including automated rollbacks, deploy safeguards, and fault mitigation, and deliver them as platform tooling that engineering teams across the company adopt for their own services.
Own and improve incident management tooling and on-call health. Reduce alert noise, surface actionable signals, and empower engineering teams to operate their services confidently with minimal operational burden.
Develop and evolve our observability infrastructure, including monitoring, alerting, SLOs, and performance regression detection, to give teams real-time, actionable visibility into system health and latency.
Contribute to AI-driven operational tooling that goes beyond triage, building toward autonomous remediation where AI detects issues, takes corrective action, and self-recovers with minimal human involvement.
Drive incident prevention by identifying systemic patterns and ruthlessly eliminating operational toil.
Partner directly with product engineering teams to diagnose reliability gaps, reduce their operational burden, and help them adopt best practices for running their services.
Define and champion operational excellence best practices across engineering through guardrails, scorecards, and standards that help teams run their services reliably by default.
Champion, role model, and embed Samsara’s cultural principles as we scale globally and across new offices.

View Full Description & ApplyYou'll be redirected to the employer's site