HPC Support Engineer - Named Accounts

Posted 3 months agoViewed

USAFull-TimeAI Cloud

Company:Lambda

Location:USA

Languages:English

Seniority level:Lead, 7+ years

Experience:7+ years

Skills:

PythonKubernetesGrafanaPrometheusLinuxDevOpsTerraformDocumentationProblem SolvingCustomer serviceMentoringAnsibleTroubleshooting

Requirements:

7+ years of experience in HPC or cloud support engineering, with customer-facing responsibilities. Proven experience managing large-scale Linux clusters and distributed HPC/AI workloads. Deep expertise in orchestration tools such as Kubernetes and/or Slurm. Strong knowledge of GPU technologies (CUDA, NCCL, MIG, NVLink, GPUDirect RDMA). Skilled in high-throughput networking (InfiniBand, RoCE) and cluster storage solutions. Familiarity with monitoring/logging platforms (Prometheus, Grafana, Datadog). Experience leading incident management and communicating directly with enterprise or hyperscale customers. Ability to balance deep technical troubleshooting with clear, concise communication.

Responsibilities:

Act as the primary technical point of escalation for Super Intelligence customers running hyperscale GPU clusters. Lead incident response for complex issues, ensuring rapid triage, clear communication, and timely resolution. Proactively identify risks in large environments (firmware, performance bottlenecks, orchestration issues) and drive preventative improvements. Partner closely with Lambda Engineering and Product teams to influence roadmap decisions. Contribute to runbooks, best practices, and operational guides tailored for hyperscale environments. Train and mentor other support engineers. Participate in a rotating on-call schedule.