Staff AI/ML Infrastructure Engineer

V
VultrCloud Infrastructure
Remote - United StatesFull-TimeStaff
Salary145000 - 160000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonBashMachine LearningLinux

Requirements

  • 5+ years experience working with bare metal infrastructure and hardware automation
  • Hands-on experience with modern NVIDIA/AMD GPU platforms
  • Hands-on experience with high-performance networking (RoCE, InfiniBand)
  • Deep knowledge of BIOS, BMC, firmware, NICs, Redfish/IPMI, and PCIe systems
  • Strong Linux systems experience including device drivers and package management
  • Experience building infrastructure automation using Python and Bash
  • Familiarity with GPU drivers, firmware ecosystems, and vendor collaboration
  • Experience designing and delivering complex infrastructure products
  • Proven ability to lead projects and mentor engineers
  • Experience optimizing multi-cluster GPU environments
  • Exposure to Machine Learning software stacks and GPU workloads

Responsibilities

  • Design and maintain GPU and bare metal infrastructure in containerized and physical environments
  • Build scalable GPU clusters in partnership with networking and provisioning teams
  • Ensure reliable, high-performance provisioning of GPU infrastructure
  • Develop automated testing systems for GPU-based platforms
  • Implement infrastructure solutions for diverse AI/ML workloads
  • Benchmark, test, and troubleshoot GPU performance at scale
  • Collaborate with hardware vendors on drivers, firmware, and support
  • Resolve hardware, software, and performance issues across environments
  • Optimize rail and cluster performance across architectures
  • Lead technical direction and mentor engineers on infrastructure best practices
View Full Description & ApplyYou'll be redirected to the employer's site
145000 - 160000 USD per year
Apply Now