Lead Software Systems Engineer - GPU Performance
New
N
NebiusCloud Infrastructure AI
Remote - United StatesFull-TimeLead
Salary$170,000 — $300,000 USD
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years of professional experience in system-level software development; 3+ years of hands-on experience with Linux systems
- Required Skills
- PythonC++GoLinux
Requirements
- 5+ years of professional experience in system-level software development focused on performance optimization and low-level programming.
- 3+ years of hands-on experience with Linux systems including administration, troubleshooting, and performance tuning.
- In-depth understanding of server architecture including PCIe devices and NICs.
- Deep knowledge of Linux OS/Kernel.
- Experience with high-performance computing (HPC) systems.
- Strong proficiency in one or more performance-oriented programming languages: C/C++, Go, or Python.
- Ability to work across the full stack including networking (InfiniBand/RoCE), virtualization (KVM/QEMU), and distributed communication layers (MPI, NCCL).
Responsibilities
- Analyze and optimize the performance of large-scale GPU clusters at the intersection of hardware and software.
- Investigate and troubleshoot GPU cluster performance issues under real training and inference workloads.
- Evaluate and integrate new hardware, system configurations, and tuning approaches through the software stack.
- Support complex performance-related escalations from internal teams and customers.
- Collaborate with infrastructure, software engineering, and hardware vendor teams including NVIDIA, Mellanox, and Intel.
- Contribute to hardware and cluster qualification and acceptance to ensure performance expectations are met.
View Full Description & ApplyYou'll be redirected to the employer's site