Senior AI Compute Infrastructure Engineer

New
K
KrakenCryptocurrency Infrastructure
United Kingdom, Canada, Portugal, Spain, Poland, Ireland, Brazil, Romania, Czech Republic, Cyprus, Lithuania, Mexico, Peru, Costa Rica, Latvia, Slovenia, Panama, Estonia, Hungary, Bulgaria, Argentina, South AfricaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Experience
5+ years
Required Skills
PythonKubernetesLinuxNetworkingDistributed Systems

Requirements

  • 5+ years of infrastructure engineering experience.
  • Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production.
  • Strong systems engineering fundamentals across Linux, networking, storage, containers, and Kubernetes.
  • Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, or Ray Serve.
  • Proficiency in Python for infrastructure automation and operational workflows.
  • Practical understanding of performance tradeoffs in GPU environments.
  • Track record of optimizing compute costs while maintaining reliability.
  • Experience building observable systems with metrics, logs, and dashboards.
  • Comfortable working in high-stakes, always-on environments.
  • Clear communicator of infrastructure tradeoffs to stakeholders.

Responsibilities

  • Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation.
  • Design infrastructure that enables Kraken teams to run models locally on GPUs.
  • Build and improve scheduling, orchestration, placement, quota management, and utilization systems.
  • Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost.
  • Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, and inference workflows.
  • Build observability for GPU utilization, memory pressure, queue depth, and token throughput.
  • Drive reliability, incident response, alerting, and post-incident improvements.
  • Evaluate and integrate new hardware, cloud instance families, and serving frameworks.
  • Build tooling to make GPU usage visible and easier for internal teams to consume.
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now