Senior Site Reliability Engineer, Robotics & Cloud Infrastructure
New
Based in the United StatesFull-TimeSenior
SalaryCompetitive base salary ranging from $164,000 to $220,000 depending on location and experience
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- AWSDockerPythonBashKubernetesGoGrafanaPrometheusLinuxTerraform
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles supporting production systems with on-call ownership.
- Strong experience designing and operating scalable cloud infrastructure, preferably on AWS, including networking, compute, storage, and IAM.
- Proficiency in infrastructure-as-code tools such as Terraform and strong automation skills using Python, Go, or Bash.
- Experience with containerization and orchestration technologies such as Docker and Kubernetes or equivalent systems.
- Strong understanding of Linux systems, networking fundamentals, and modern observability tooling (e.g., Prometheus, Grafana, or equivalents).
- Experience operating in hybrid environments that include edge or embedded systems, intermittent connectivity, or physical hardware constraints.
- Strong incident management mindset with experience improving operational reliability, reducing toil, and building scalable on-call practices.
- Ability to write clear documentation, automate repetitive workflows, and design systems that reduce reliance on tribal knowledge.
- Excellent communication skills and strong ownership mentality in fast-moving, small-team environments.
- Comfort working across robotics, cloud infrastructure, and distributed data systems.
Responsibilities
- Own end-to-end system reliability across the full stack, including onboard robotics compute, operator systems, cloud infrastructure, and data delivery platforms.
- Build and enhance infrastructure automation for provisioning, deployment, configuration management, and self-healing system behaviors across edge and cloud environments.
- Design and scale observability systems (metrics, logging, tracing, alerting) to provide actionable insights across vehicle fleets and distributed cloud services.
- Reduce operational overhead by eliminating single points of failure, automating manual workflows, and documenting runbooks for repeatable incident resolution.
- Participate in a shared on-call rotation covering robotics and cloud incidents, while leading blameless postmortems and reliability improvements.
- Define and track system reliability metrics such as uptime, data yield, and recovery time, aligned with continuous autonomous operations.
- Manage and optimize AWS infrastructure across compute, storage, networking, security, and cost efficiency for large-scale data processing workloads.
- Improve deployment safety, configuration management, and rollback strategies for fleet-wide updates across robotics systems.
- Collaborate closely with robotics, data, and platform teams to embed reliability into system design from the ground up.
View Full Description & ApplyYou'll be redirected to the employer's site