Senior Site Reliability Engineer - AI Infrastructure

A
Andromeda ClusterAI Infrastructure
Global Remote / San Francisco, CAFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
PythonBashKubernetesPyTorchGoGrafanaPrometheusLinuxTerraformAnsibleHelm

Requirements

  • Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
  • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training.
  • Working knowledge of distributed training & ML frameworks (NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar).
  • Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling.
  • Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators (or Slurm/HPC schedulers).
  • Strong engineering skills in Python, Go, or Bash.
  • Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent).
  • Hands-on experience building monitoring and alerting for GPU infrastructure (DCGM, nvidia-smi, fabric manager metrics).
  • Proven track record leading incident response for complex distributed systems.

Responsibilities

  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
  • Serve as the primary technical point of contact for customers running large-scale training workloads.
  • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure.
  • Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink).
  • Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health.
  • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
  • Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now