Session
KV-Cache Centric Inference: Building an Open Source LLM Serving Platform Around State
We optimize LLM inference around compute -faster kernels, better batching, smarter parallelism. But in production, the real bottleneck is state. The KV‑cache holds precomputed attention data that turns a multi‑second prefill into a sub‑second cache hit. Lose it to eviction, isolate it on one node, or route away from it, and you pay the full compute cost again for work you already did.
llm-d is an open-source distributed inference platform, co-founded by Google, IBM Research, Red Hat, NVIDIA, and CoreWeave, that treats the KV‑cache as the core of the system rather than a byproduct. That enables tiered memory management - offloading KV blocks from GPU to CPU to shared storage - cross‑replica reuse so cached state computed anywhere is usable everywhere, and cache‑aware scheduling that routes requests to the replica most likely to hold their prefix.
This session walks through how llm-d and vLLM implement each layer of this stack, how they combine into a production system, and what the open‑source community can build on top. We’ll share benchmarks, Kubernetes deployment patterns, and practical guidance for operators running LLM workloads at scale.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top