Senior HPC Cluster Engineer
New
United KingdomFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization; 3+ years of hands-on Linux systems experience
- Required Skills
- PythonKubernetesC++GoLinux
Requirements
- 5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization.
- 3+ years of hands-on Linux systems experience, including administration, troubleshooting, and performance tuning.
- Strong understanding of server architecture, Linux kernel fundamentals, PCIe devices, NICs, and high-performance computing systems.
- Strong programming expertise in languages such as C, C++, Go, Python, or similar performance-oriented technologies.
- Experience working with GPU infrastructure, HPC environments, distributed systems, or performance-critical systems.
- Familiarity with containerization, orchestration, or virtualization technologies including Kubernetes, QEMU, or KVM.
- Strong troubleshooting, systems analysis, and problem-solving capabilities in complex technical environments.
- Excellent collaboration and communication skills.
Responsibilities
- Optimize the performance, scalability, and reliability of GPU clusters and InfiniBand networks within high-performance computing environments.
- Analyze, troubleshoot, and resolve root-cause issues related to GPU infrastructure, networking performance, and distributed computing systems.
- Integrate and support new hardware technologies, including GPU devices, within existing cloud and virtualization environments.
- Configure, maintain, and improve GPU device orchestration, Kubernetes integrations, and virtualization stacks such as QEMU and KVM.
- Design and enhance automation systems for proactive monitoring, fault detection, and issue remediation across HPC clusters.
- Collaborate with engineering teams to improve infrastructure efficiency, system performance, and platform resilience.
- Conduct performance analysis and optimization for HPC workloads, including AI/ML training, simulations, and large-scale data processing.
- Contribute to infrastructure evolution by improving networking, virtualization, and distributed computing capabilities.
View Full Description & ApplyYou'll be redirected to the employer's site