Senior Systems Software Engineer, GPU Compute

New
N
NebiusCloud AI Infrastructure
Remote - United StatesFull-TimeSenior
Salary$170k-$300k + equity
Apply NowOpens the employer's application page

Job Details

Experience
5+ years of professional experience in system-level software development
Required Skills
PythonC++GoLinux

Requirements

  • 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
  • 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
  • Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).
  • Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
  • Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
  • Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
  • Background in Software-Defined Networking (SDN) and experience with HPC cluster networking.
  • Understanding of QEMU/KVM virtualization and managing virtualized environments.
  • Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
  • Familiarity with collective communication libraries like MPI and NCCL for distributed computing.

Responsibilities

  • Tune the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
  • Analyze and troubleshoot the root cause of issues related to GPUs and InfiniBand networks, and propose corrective actions.
  • Integrate new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
  • Enhance automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
  • Configure and manage GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.
View Full Description & ApplyYou'll be redirected to the employer's site
$170k-$300k + equity
Apply Now