Senior Site Reliability Engineer, Robotics & Cloud Infrastructure

New
Based in the United StatesFull-TimeSenior
SalaryCompetitive base salary ranging from $164,000 to $220,000 depending on location and experience
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
AWSDockerPythonBashKubernetesGoGrafanaPrometheusLinuxTerraform

Requirements

  • 5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles supporting production systems with on-call ownership.
  • Strong experience designing and operating scalable cloud infrastructure, preferably on AWS, including networking, compute, storage, and IAM.
  • Proficiency in infrastructure-as-code tools such as Terraform and strong automation skills using Python, Go, or Bash.
  • Experience with containerization and orchestration technologies such as Docker and Kubernetes or equivalent systems.
  • Strong understanding of Linux systems, networking fundamentals, and modern observability tooling (e.g., Prometheus, Grafana, or equivalents).
  • Experience operating in hybrid environments that include edge or embedded systems, intermittent connectivity, or physical hardware constraints.
  • Strong incident management mindset with experience improving operational reliability, reducing toil, and building scalable on-call practices.
  • Ability to write clear documentation, automate repetitive workflows, and design systems that reduce reliance on tribal knowledge.
  • Excellent communication skills and strong ownership mentality in fast-moving, small-team environments.
  • Comfort working across robotics, cloud infrastructure, and distributed data systems.

Responsibilities

  • Own end-to-end system reliability across the full stack, including onboard robotics compute, operator systems, cloud infrastructure, and data delivery platforms.
  • Build and enhance infrastructure automation for provisioning, deployment, configuration management, and self-healing system behaviors across edge and cloud environments.
  • Design and scale observability systems (metrics, logging, tracing, alerting) to provide actionable insights across vehicle fleets and distributed cloud services.
  • Reduce operational overhead by eliminating single points of failure, automating manual workflows, and documenting runbooks for repeatable incident resolution.
  • Participate in a shared on-call rotation covering robotics and cloud incidents, while leading blameless postmortems and reliability improvements.
  • Define and track system reliability metrics such as uptime, data yield, and recovery time, aligned with continuous autonomous operations.
  • Manage and optimize AWS infrastructure across compute, storage, networking, security, and cost efficiency for large-scale data processing workloads.
  • Improve deployment safety, configuration management, and rollback strategies for fleet-wide updates across robotics systems.
  • Collaborate closely with robotics, data, and platform teams to embed reliability into system design from the ground up.
View Full Description & ApplyYou'll be redirected to the employer's site
Competitive base salary ranging from $164,000 to $220,000 depending on location and experience
Apply Now