Senior Site Reliability Engineer, Robotics & Cloud Infrastructure

New

Based in the United StatesFull-TimeSenior

SalaryCompetitive base salary ranging from $164,000 to $220,000 depending on location and experience

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: AWSDockerPythonBashKubernetesGoGrafanaPrometheusLinuxTerraform

5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles supporting production systems with on-call ownership.
Strong experience designing and operating scalable cloud infrastructure, preferably on AWS, including networking, compute, storage, and IAM.
Proficiency in infrastructure-as-code tools such as Terraform and strong automation skills using Python, Go, or Bash.
Experience with containerization and orchestration technologies such as Docker and Kubernetes or equivalent systems.
Strong understanding of Linux systems, networking fundamentals, and modern observability tooling (e.g., Prometheus, Grafana, or equivalents).
Experience operating in hybrid environments that include edge or embedded systems, intermittent connectivity, or physical hardware constraints.
Strong incident management mindset with experience improving operational reliability, reducing toil, and building scalable on-call practices.
Ability to write clear documentation, automate repetitive workflows, and design systems that reduce reliance on tribal knowledge.
Excellent communication skills and strong ownership mentality in fast-moving, small-team environments.
Comfort working across robotics, cloud infrastructure, and distributed data systems.

Own end-to-end system reliability across the full stack, including onboard robotics compute, operator systems, cloud infrastructure, and data delivery platforms.
Build and enhance infrastructure automation for provisioning, deployment, configuration management, and self-healing system behaviors across edge and cloud environments.
Design and scale observability systems (metrics, logging, tracing, alerting) to provide actionable insights across vehicle fleets and distributed cloud services.
Reduce operational overhead by eliminating single points of failure, automating manual workflows, and documenting runbooks for repeatable incident resolution.
Participate in a shared on-call rotation covering robotics and cloud incidents, while leading blameless postmortems and reliability improvements.
Define and track system reliability metrics such as uptime, data yield, and recovery time, aligned with continuous autonomous operations.
Manage and optimize AWS infrastructure across compute, storage, networking, security, and cost efficiency for large-scale data processing workloads.
Improve deployment safety, configuration management, and rollback strategies for fleet-wide updates across robotics systems.
Collaborate closely with robotics, data, and platform teams to embed reliability into system design from the ground up.

View Full Description & ApplyYou'll be redirected to the employer's site