Infrastructure Engineer (GPU & Compute)

New
Fully remote within the U.S.Full-TimeSenior
Salary180,000 - 200,000 USD per year
Apply NowOpens the employer's application page

Job Details

Experience
5+ years of experience in infrastructure engineering, systems engineering, or related roles
Required Skills
PythonLinux

Requirements

  • 5+ years of experience in infrastructure engineering, systems engineering, or related roles
  • Strong Linux systems experience in production environments
  • Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
  • Familiarity with bare-metal provisioning and system bring-up workflows
  • Proficiency in Python or similar scripting/programming languages for automation
  • Ability to debug complex issues across hardware, OS, GPUs, and system software
  • Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
  • Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
  • Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
  • Data center operations experience, including working with physical hardware
  • Experience supporting AI/ML or HPC workloads at scale
  • Experience with GPU validation frameworks or large-scale hardware qualification processes

Responsibilities

  • Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
  • Run and maintain test clusters used for system validation, diagnostics, and bring-up
  • Validate firmware, drivers, and OS images across compute and GPU-enabled systems
  • Support hardware qualification efforts for next-generation platforms
  • Own GPU diagnostics and validation workflows across large-scale infrastructure
  • Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
  • Analyze system and GPU performance using tools such as NVIDIA DCGM
  • Identify failure patterns and drive improvements in system stability and validation coverage
  • Build and maintain automation for provisioning, validation, and system bring-up
  • Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
  • Improve the reliability, repeatability, and scalability of image pipelines and validation systems
  • Manage and operate Linux-based systems in production and validation environments
  • Manage virtualization technology
  • Support bare-metal provisioning workflows, including PXE and image-based systems
  • Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
  • Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
  • Collaborate with platform and ML teams to ensure systems meet workload requirements
  • Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure
View Full Description & ApplyYou'll be redirected to the employer's site
180,000 - 200,000 USD per year
Apply Now