Staff AI/ML Infrastructure Engineer
V
VultrCloud Infrastructure
Remote - United StatesFull-TimeStaff
Salary145000 - 160000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonBashMachine LearningLinux
Requirements
- 5+ years experience working with bare metal infrastructure and hardware automation
- Hands-on experience with modern NVIDIA/AMD GPU platforms
- Hands-on experience with high-performance networking (RoCE, InfiniBand)
- Deep knowledge of BIOS, BMC, firmware, NICs, Redfish/IPMI, and PCIe systems
- Strong Linux systems experience including device drivers and package management
- Experience building infrastructure automation using Python and Bash
- Familiarity with GPU drivers, firmware ecosystems, and vendor collaboration
- Experience designing and delivering complex infrastructure products
- Proven ability to lead projects and mentor engineers
- Experience optimizing multi-cluster GPU environments
- Exposure to Machine Learning software stacks and GPU workloads
Responsibilities
- Design and maintain GPU and bare metal infrastructure in containerized and physical environments
- Build scalable GPU clusters in partnership with networking and provisioning teams
- Ensure reliable, high-performance provisioning of GPU infrastructure
- Develop automated testing systems for GPU-based platforms
- Implement infrastructure solutions for diverse AI/ML workloads
- Benchmark, test, and troubleshoot GPU performance at scale
- Collaborate with hardware vendors on drivers, firmware, and support
- Resolve hardware, software, and performance issues across environments
- Optimize rail and cluster performance across architectures
- Lead technical direction and mentor engineers on infrastructure best practices
View Full Description & ApplyYou'll be redirected to the employer's site