Senior HPC Cluster Engineer

New
United KingdomFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization; 3+ years of hands-on Linux systems experience
Required Skills
PythonKubernetesC++GoLinux

Requirements

  • 5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization.
  • 3+ years of hands-on Linux systems experience, including administration, troubleshooting, and performance tuning.
  • Strong understanding of server architecture, Linux kernel fundamentals, PCIe devices, NICs, and high-performance computing systems.
  • Strong programming expertise in languages such as C, C++, Go, Python, or similar performance-oriented technologies.
  • Experience working with GPU infrastructure, HPC environments, distributed systems, or performance-critical systems.
  • Familiarity with containerization, orchestration, or virtualization technologies including Kubernetes, QEMU, or KVM.
  • Strong troubleshooting, systems analysis, and problem-solving capabilities in complex technical environments.
  • Excellent collaboration and communication skills.

Responsibilities

  • Optimize the performance, scalability, and reliability of GPU clusters and InfiniBand networks within high-performance computing environments.
  • Analyze, troubleshoot, and resolve root-cause issues related to GPU infrastructure, networking performance, and distributed computing systems.
  • Integrate and support new hardware technologies, including GPU devices, within existing cloud and virtualization environments.
  • Configure, maintain, and improve GPU device orchestration, Kubernetes integrations, and virtualization stacks such as QEMU and KVM.
  • Design and enhance automation systems for proactive monitoring, fault detection, and issue remediation across HPC clusters.
  • Collaborate with engineering teams to improve infrastructure efficiency, system performance, and platform resilience.
  • Conduct performance analysis and optimization for HPC workloads, including AI/ML training, simulations, and large-scale data processing.
  • Contribute to infrastructure evolution by improving networking, virtualization, and distributed computing capabilities.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now