- Optimize inference runtime performance for LLM and diffusion models across diverse hardware.
- Develop low-level CUDA kernels or equivalent (Triton, TileLang, Pallas) to enhance inference speed.
- Design and implement high-performance distributed systems for serving models at scale.
- Build operational infrastructure for cluster management, deployment automation, and production monitoring.
- Manage KV-cache memory and model serving within the vLLM stack.
- Collaborate asynchronously with global team members while maintaining code ownership.
PythonKubernetesPyTorch+5 more