Senior Site Reliability Engineer - AI Infrastructure

Global Remote / San Francisco, CAFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Required Skills: PythonBashKubernetesPyTorchGoGrafanaPrometheusLinuxTerraformAnsibleHelm

Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent).
Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training.
Working knowledge of distributed training & ML frameworks (NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar).
Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling.
Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators (or Slurm/HPC schedulers).
Strong engineering skills in Python, Go, or Bash.
Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent).
Hands-on experience building monitoring and alerting for GPU infrastructure (DCGM, nvidia-smi, fabric manager metrics).
Proven track record leading incident response for complex distributed systems.

Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
Serve as the primary technical point of contact for customers running large-scale training workloads.
Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure.
Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink).
Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health.
Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks.

View Full Description & ApplyYou'll be redirected to the employer's site