Session
Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving
Agentic AI workloads - tree-of-thought exploration, ReAct loops, hierarchical swarms - expose a fundamental mismatch in how we serve PyTorch models. Today's inference stacks treat the KV-cache as a flat, anonymous tensor buffer with blind LRU eviction. This ignores the structural reality of agents: system prompts are durable, tool definitions are shared, and reasoning scratchpads are ephemeral. We are currently evicting high-value state to preserve throwaway tokens.
In this talk, we present Semantic KV-Cache, an architectural evolution for llm-d and vLLM that replaces anonymous blocks with Typed State.
We demonstrate a runtime that tags blocks as SystemPrompt, ToolDefinition, or ReasoningBranch, applying differentiated policies to each: pinning foundational context, replicating shared tools, and eagerly evicting completed thoughts. We show how this "lifecycle-aware" caching reduces recomputation and minimizes the "Agentic Tax" - evolving the PyTorch serving stack from request-centric to workload-aware.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top