Machine Learning Engineer - Inference Optimization

Remote (world)Full-TimeMiddle
Salary not disclosed
Apply NowOpens the employer's application page

Job Details

Required Skills
PyTorch

Requirements

  • Strong experience in ML inference optimization or high-performance ML systems
  • Solid understanding of deep learning internals (attention, memory layout, compute graphs)
  • Hands-on experience with PyTorch (or similar) and model deployment
  • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
  • Experience scaling inference for real users (not just research benchmarks)
  • Comfortable working in fast-moving startup environments with ownership and ambiguity

Responsibilities

  • Optimize inference latency, throughput, and cost for large-scale ML models in production
  • Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
  • Implement and tune techniques such as: Quantization (fp16, bf16, int8, fp8), KV-cache optimization & reuse, Speculative decoding, batching, and streaming, Model pruning or architectural simplifications for inference
  • Collaborate with research engineers to productionize new model architectures
  • Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
  • Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
  • Improve system reliability, observability, and cost efficiency under real workloads
View Full Description & ApplyYou'll be redirected to the employer's site
View details
Apply Now