Senior AI Compute Infrastructure Engineer
New
K
KrakenCryptocurrency Infrastructure
United Kingdom, Canada, Portugal, Spain, Poland, Ireland, Brazil, Romania, Czech Republic, Cyprus, Lithuania, Mexico, Peru, Costa Rica, Latvia, Slovenia, Panama, Estonia, Hungary, Bulgaria, Argentina, South AfricaFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 5+ years
- Required Skills
- PythonKubernetesLinuxNetworkingDistributed Systems
Requirements
- 5+ years of infrastructure engineering experience.
- Hands-on experience operating GPU clusters or accelerator-backed infrastructure in production.
- Strong systems engineering fundamentals across Linux, networking, storage, containers, and Kubernetes.
- Experience with ML serving frameworks such as vLLM, Triton Inference Server, TensorRT, TorchServe, KServe, or Ray Serve.
- Proficiency in Python for infrastructure automation and operational workflows.
- Practical understanding of performance tradeoffs in GPU environments.
- Track record of optimizing compute costs while maintaining reliability.
- Experience building observable systems with metrics, logs, and dashboards.
- Comfortable working in high-stakes, always-on environments.
- Clear communicator of infrastructure tradeoffs to stakeholders.
Responsibilities
- Own and operate GPU and accelerator clusters used for training, inference, evaluation, and experimentation.
- Design infrastructure that enables Kraken teams to run models locally on GPUs.
- Build and improve scheduling, orchestration, placement, quota management, and utilization systems.
- Optimize inference pipelines for latency, throughput, reliability, memory efficiency, and cost.
- Partner with ML engineers and researchers to remove bottlenecks in training, evaluation, and inference workflows.
- Build observability for GPU utilization, memory pressure, queue depth, and token throughput.
- Drive reliability, incident response, alerting, and post-incident improvements.
- Evaluate and integrate new hardware, cloud instance families, and serving frameworks.
- Build tooling to make GPU usage visible and easier for internal teams to consume.
View Full Description & ApplyYou'll be redirected to the employer's site