Senior HPC Cluster Engineer

New

United KingdomFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization; 3+ years of hands-on Linux systems experience
Required Skills: PythonKubernetesC++GoLinux

5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization.
3+ years of hands-on Linux systems experience, including administration, troubleshooting, and performance tuning.
Strong understanding of server architecture, Linux kernel fundamentals, PCIe devices, NICs, and high-performance computing systems.
Strong programming expertise in languages such as C, C++, Go, Python, or similar performance-oriented technologies.
Experience working with GPU infrastructure, HPC environments, distributed systems, or performance-critical systems.
Familiarity with containerization, orchestration, or virtualization technologies including Kubernetes, QEMU, or KVM.
Strong troubleshooting, systems analysis, and problem-solving capabilities in complex technical environments.
Excellent collaboration and communication skills.

Optimize the performance, scalability, and reliability of GPU clusters and InfiniBand networks within high-performance computing environments.
Analyze, troubleshoot, and resolve root-cause issues related to GPU infrastructure, networking performance, and distributed computing systems.
Integrate and support new hardware technologies, including GPU devices, within existing cloud and virtualization environments.
Configure, maintain, and improve GPU device orchestration, Kubernetes integrations, and virtualization stacks such as QEMU and KVM.
Design and enhance automation systems for proactive monitoring, fault detection, and issue remediation across HPC clusters.
Collaborate with engineering teams to improve infrastructure efficiency, system performance, and platform resilience.
Conduct performance analysis and optimization for HPC workloads, including AI/ML training, simulations, and large-scale data processing.
Contribute to infrastructure evolution by improving networking, virtualization, and distributed computing capabilities.

View Full Description & ApplyYou'll be redirected to the employer's site