Session
Serving Transformers at Scale: KV-Cache-Aware Routing for PyTorch Inference
Large-scale language model inference is dominated not just by FLOPs - but by cache. As transformer models scale in context length and session complexity, reusing key-value attention cache (KV-Cache) becomes critical for reducing latency, cost, and improving throughput.
This talk presents a KV-Cache-centric approach to scalable PyTorch-based inference, showing how cache reuse and intelligent request routing dramatically improve performance.
We’ll explore:
- The impact of KV-Cache offloading and reuse on latency and resource efficiency in long-context inference
- Load balancing challenges under cache constraints, and routing strategies that maximize reuse
- How llm-d, an open-source project by Google, IBM Research, and RedHat, enables KV-Cache-aware routing and prefix-matching to boost hit rates and GPU utilization
- Real-world benchmarks showing latency gains, compute savings, and scalability under high-concurrency, AI workloads
We’ll walk through key system design choices, from memory locality and CPU offloading to coordinated routing for sessionful AI workloads - ideal for anyone serving LLMs in production or building scalable inference platforms.
A deep dive into how cache reuse, attention prefix-cache aware scheduling, and intelligent routing can dramatically improve latency, throughput, and GPU utilization in PyTorch-based LLM serving—powered by lessons from llm-d, a production-grade open source framework.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top