Session

KV-Cache Wins You Can Feel: Building AI-Aware LLM Routing on Kubernetes

Every LLM request carries invisible state: the KV-cache. Hit it, and your response is 10x cheaper and 50x faster. Miss it, and you're recomputing work you just did. Yet Kubernetes' default load balancing is cache-blind, scattering related requests across pods and destroying locality. The result? Your AI workloads are slower and vastly more expensive than they should be.

In this hands-on tutorial, we’ll fix that.

Attendees will deploy a distributed vLLM cluster, benchmark its performance, and visualize how cache-blind routing wastes GPU cycles. Then, we’ll replace the default Service with the Kubernetes Gateway API (Inference Extension) and deploy llm-d, a Kubernetes-native framework for distributed LLM inference with an AI-aware scheduler. By re-running the same benchmarks, you’ll see latency and throughput transform as prefix-reuse becomes first-class. You’ll leave with a working lab, dashboards, and a mental model for building cache-aware routing into any production AI stack.

Maroon Ayoub

Research Scientist & Architect, IBM Research

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top