Model Serving Engineer
New
100% Remote (Continental United States)Full-TimeSenior
Salary100,000 - 150,000 USD per year
Apply NowOpens the employer's application page
Job Details
- Experience
- 6+ years
- Required Skills
- PythonKubernetesC++GoRustDistributed Systems
Requirements
- Bachelor’s or Master’s degree in Computer Science or a related field.
- Six or more years of experience in distributed systems, infrastructure, or ML platform engineering.
- Strong proficiency in Python and a systems language such as Go, Rust, or C++.
- Deep experience operating high-throughput, low-latency services in production.
- Hands-on experience with LLM or large model inference frameworks such as vLLM or TensorRT-LLM.
- Strong understanding of GPU architecture, memory hierarchies, and accelerator utilization.
- Familiarity with Kubernetes, autoscaling, and modern cloud platforms.
- Experience with observability stacks including metrics, tracing, and structured logging.
- Solid grounding in performance engineering and capacity planning.
- Strong communication and incident response skills.
Responsibilities
- Design and operate model serving platforms supporting diverse workloads including LLMs, vision models, and recommendation systems.
- Optimize inference performance using continuous batching, paged attention, speculative decoding, and request multiplexing.
- Implement multi-tenant routing, rate limiting, and quality-of-service policies across model endpoints.
- Build autoscaling and capacity management systems that balance latency, throughput, and cost.
- Tune GPU utilization, memory management, and KV cache strategies for LLM serving workloads.
- Integrate model serving with API gateways, identity systems, and observability platforms.
- Drive end-to-end observability including latency histograms, queue dynamics, GPU utilization, and error tracking.
- Develop deployment workflows including canary releases, shadow testing, and automated rollback.
View Full Description & ApplyYou'll be redirected to the employer's site