Session
Routing Stateful AI Workloads in Kubernetes
Kubernetes excels at stateless service routing - but modern AI workloads are not stateless. Generative workloads demand context-aware routing that maximizes performance while reducing costs.
This talk explores layered routing strategies for stateful LLM workloads on Kubernetes - from round-robin to full KV-Cache-aware load balancing. We’ll explain when each level applies, and its effects on performance.
Based on our experience developing llm-d - a framework using the K8s Gateway API Inference Extension, a collaboration between Google, IBM Research, and RedHat - we’ll cover:
- Why traditional Kubernetes routing falls short for generative AI
- Routing patterns for long-context, sessionful traffic
- Global cache indices and local offloading for smart routing
- Benchmarks showing latency, cache hit rates, and GPU utilization
- Practical ways to adopt cache-aware routing without major infra changes
If you’re scaling multi-turn, agentic, or LLM-powered workloads, this session is for you.
This talk distills our experience building llm-d, focusing on progressive levels of context-aware routing for LLMs - from stateless to cache-aware. Attendees will leave with practical patterns for scaling generative AI serving on Kubernetes.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top