Infrastructure Engineer (GPU & Compute)
New
L
Lightning AIAI/ML
Fully remote within the U.S.Full-TimeSenior
Salary180,000 - 200,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Required Skills
- PythonLinux
Requirements
- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Strong Linux systems experience in production environments
- Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
- Familiarity with bare-metal provisioning and system bring-up workflows
- Proficiency in Python or similar scripting/programming languages for automation
- Ability to debug complex issues across hardware, OS, GPUs, and system software
- Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
- Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
- Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
- Data center operations experience, including working with physical hardware
- Experience supporting AI/ML or HPC workloads at scale
- Experience with GPU validation frameworks or large-scale hardware qualification processes
Responsibilities
- Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
- Run and maintain test clusters used for system validation, diagnostics, and bring-up
- Validate firmware, drivers, and OS images across compute and GPU-enabled systems
- Support hardware qualification efforts for next-generation platforms
- Own GPU diagnostics and validation workflows across large-scale infrastructure
- Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
- Analyze system and GPU performance using tools such as NVIDIA DCGM
- Identify failure patterns and drive improvements in system stability and validation coverage
- Build and maintain automation for provisioning, validation, and system bring-up
- Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
- Improve the reliability, repeatability, and scalability of image pipelines and validation systems
- Manage and operate Linux-based systems in production and validation environments
- Manage virtualization technology
- Support bare-metal provisioning workflows, including PXE and image-based systems
- Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
- Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
- Collaborate with platform and ML teams to ensure systems meet workload requirements
- Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure
View Full Description & ApplyYou'll be redirected to the employer's site