Senior Site Reliability Engineer - AI Infrastructure
A
Andromeda ClusterAI Infrastructure
Global Remote / San Francisco, CAFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Required Skills
- PythonBashKubernetesPyTorchGoGrafanaPrometheusLinuxTerraformAnsibleHelm
Requirements
- Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
- Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training.
- Working knowledge of distributed training & ML frameworks (NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar).
- Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling.
- Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators (or Slurm/HPC schedulers).
- Strong engineering skills in Python, Go, or Bash.
- Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent).
- Hands-on experience building monitoring and alerting for GPU infrastructure (DCGM, nvidia-smi, fabric manager metrics).
- Proven track record leading incident response for complex distributed systems.
Responsibilities
- Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
- Serve as the primary technical point of contact for customers running large-scale training workloads.
- Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure.
- Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink).
- Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health.
- Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
- Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks.
View Full Description & ApplyYou'll be redirected to the employer's site