Lead Software Systems Engineer - GPU Performance

New
N
NebiusCloud Infrastructure AI
Remote - United StatesFull-TimeLead
Salary$170,000 — $300,000 USD
Apply NowOpens the employer's application page

Job Details

Experience
5+ years of professional experience in system-level software development; 3+ years of hands-on experience with Linux systems
Required Skills
PythonC++GoLinux

Requirements

  • 5+ years of professional experience in system-level software development focused on performance optimization and low-level programming.
  • 3+ years of hands-on experience with Linux systems including administration, troubleshooting, and performance tuning.
  • In-depth understanding of server architecture including PCIe devices and NICs.
  • Deep knowledge of Linux OS/Kernel.
  • Experience with high-performance computing (HPC) systems.
  • Strong proficiency in one or more performance-oriented programming languages: C/C++, Go, or Python.
  • Ability to work across the full stack including networking (InfiniBand/RoCE), virtualization (KVM/QEMU), and distributed communication layers (MPI, NCCL).

Responsibilities

  • Analyze and optimize the performance of large-scale GPU clusters at the intersection of hardware and software.
  • Investigate and troubleshoot GPU cluster performance issues under real training and inference workloads.
  • Evaluate and integrate new hardware, system configurations, and tuning approaches through the software stack.
  • Support complex performance-related escalations from internal teams and customers.
  • Collaborate with infrastructure, software engineering, and hardware vendor teams including NVIDIA, Mellanox, and Intel.
  • Contribute to hardware and cluster qualification and acceptance to ensure performance expectations are met.
View Full Description & ApplyYou'll be redirected to the employer's site
$170,000 — $300,000 USD
Apply Now