- Push throughput via continuous batching, speculative decoding, and kernel-level tuning.
- Cut latency by identifying and fixing compute, memory, or network bottlenecks.
- Optimize KV cache via paged attention, prefix caching, and quantization.
- Perform empirical quantization work across weights, activations, and KV.
- Shrink cold starts and memory footprint.
- Scale across nodes using distributed inference topologies.
- Set technical direction for benchmarking and architecture.
PythonKubernetesPyTorch+1 more