Senior AI Compute Infrastructure Engineer

New

KrakenCryptocurrency Infrastructure

United Kingdom, Canada, Portugal, Spain, Poland, Ireland, Brazil, Romania, Czech Republic, Cyprus, Lithuania, Mexico, Peru, Costa Rica, Latvia, Slovenia, Panama, Estonia, Hungary, Bulgaria, Argentina, South AfricaFull-TimeSenior

Salary not disclosed

Apply NowOpens the employer's application page

Job Details

Experience: 5+ years
Required Skills: PythonKubernetesLinuxNetworkingDistributed Systems

Requirements

5+ years of infrastructure engineering experience.
Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production.
Strong systems engineering fundamentals across Linux, networking, storage, containers, and Kubernetes.
Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, or Ray Serve.
Proficiency in Python for infrastructure automation and operational workflows.
Practical understanding of performance tradeoffs in GPU environments.
Track record of optimizing compute costs while maintaining reliability.
Experience building observable systems with metrics, logs, and dashboards.
Comfortable working in high-stakes, always-on environments.
Clear communicator of infrastructure tradeoffs to stakeholders.

Responsibilities

Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation.
Design infrastructure that enables Kraken teams to run models locally on GPUs.
Build and improve scheduling, orchestration, placement, quota management, and utilization systems.
Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost.
Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, and inference workflows.
Build observability for GPU utilization, memory pressure, queue depth, and token throughput.
Drive reliability, incident response, alerting, and post-incident improvements.
Evaluate and integrate new hardware, cloud instance families, and serving frameworks.
Build tooling to make GPU usage visible and easier for internal teams to consume.

View Full Description & ApplyYou'll be redirected to the employer's site