Speaker

Maroon Ayoub

Maroon Ayoub

Research Scientist & Architect, IBM Research

Actions

Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, performance efficiency, and open source integrations.

KV-Cache Centric Inference: Building an Open Source LLM Serving Platform Around State

We optimize LLM inference around compute -faster kernels, better batching, smarter parallelism. But in production, the real bottleneck is state. The KV‑cache holds precomputed attention data that turns a multi‑second prefill into a sub‑second cache hit. Lose it to eviction, isolate it on one node, or route away from it, and you pay the full compute cost again for work you already did.

llm-d is an open-source distributed inference platform, co-founded by Google, IBM Research, Red Hat, NVIDIA, and CoreWeave, that treats the KV‑cache as the core of the system rather than a byproduct. That enables tiered memory management - offloading KV blocks from GPU to CPU to shared storage - cross‑replica reuse so cached state computed anywhere is usable everywhere, and cache‑aware scheduling that routes requests to the replica most likely to hold their prefix.

This session walks through how llm-d and vLLM implement each layer of this stack, how they combine into a production system, and what the open‑source community can build on top. We’ll share benchmarks, Kubernetes deployment patterns, and practical guidance for operators running LLM workloads at scale.

Not All Tokens Are Equal: Semantic KV-Cache for Agentic LLM Serving

Agentic AI workloads - tree-of-thought exploration, ReAct loops, hierarchical swarms - expose a fundamental mismatch in how we serve PyTorch models. Today's inference stacks treat the KV-cache as a flat, anonymous tensor buffer with blind LRU eviction. This ignores the structural reality of agents: system prompts are durable, tool definitions are shared, and reasoning scratchpads are ephemeral. We are currently evicting high-value state to preserve throwaway tokens.

In this talk, we present Semantic KV-Cache, an architectural evolution for llm-d and vLLM that replaces anonymous blocks with Typed State.

We demonstrate a runtime that tags blocks as SystemPrompt, ToolDefinition, or ReasoningBranch, applying differentiated policies to each: pinning foundational context, replicating shared tools, and eagerly evicting completed thoughts. We show how this "lifecycle-aware" caching reduces recomputation and minimizes the "Agentic Tax" - evolving the PyTorch serving stack from request-centric to workload-aware.

Disaggregated Tokenization: Building Toward Tokens-In-Tokens-Out LLM Inference

LLMs are token-in, token-out - but our serving stacks aren't. Tokenization and preprocessing are still locked inside the inference engine, blocking the cache-aware routing and encode/prefill/decode (E/P/D) disaggregation that production deployments demand. To route smart, you need tokens before you reach the backend - and with multi-modal inputs requiring heavy encode-stage preprocessing, this is an architectural imperative, not just an optimization.

In llm-d, we learned this the hard way: three tokenization approaches, three gaps. We're now converging on disaggregated tokenization via vLLM's Renderer API as a gRPC sidecar, and collaborating with the Gateway API Inference Extension community to define the tokens-in-tokens-out interface. For multi-modal workloads, disaggregating preprocessing unlocks independent scaling of encode, prefill, and decode - each with different compute profiles.

Join us to discuss: How should we standardize tokenization and multi-modal preprocessing outside the engine? How does this shape E/P/D disaggregation? What are your pain points? We'll frame the problem from scheduling, vLLM, and gateway perspectives - then open the floor.

KV-Cache Wins You Can Feel: Building AI-Aware LLM Routing on Kubernetes

Every LLM request carries invisible state: the KV-cache. Hit it, and your response is 10x cheaper and 50x faster. Miss it, and you're recomputing work you just did. Yet Kubernetes' default load balancing is cache-blind, scattering related requests across pods and destroying locality. The result? Your AI workloads are slower and vastly more expensive than they should be.

In this hands-on tutorial, we’ll fix that.

Attendees will deploy a distributed vLLM cluster, benchmark its performance, and visualize how cache-blind routing wastes GPU cycles. Then, we’ll replace the default Service with the Kubernetes Gateway API (Inference Extension) and deploy llm-d, a Kubernetes-native framework for distributed LLM inference with an AI-aware scheduler. By re-running the same benchmarks, you’ll see latency and throughput transform as prefix-reuse becomes first-class. You’ll leave with a working lab, dashboards, and a mental model for building cache-aware routing into any production AI stack.

Serving Transformers at Scale: KV-Cache-Aware Routing for PyTorch Inference

Large-scale language model inference is dominated not just by FLOPs - but by cache. As transformer models scale in context length and session complexity, reusing key-value attention cache (KV-Cache) becomes critical for reducing latency, cost, and improving throughput.

This talk presents a KV-Cache-centric approach to scalable PyTorch-based inference, showing how cache reuse and intelligent request routing dramatically improve performance.

We’ll explore:
- The impact of KV-Cache offloading and reuse on latency and resource efficiency in long-context inference
- Load balancing challenges under cache constraints, and routing strategies that maximize reuse
- How llm-d, an open-source project by Google, IBM Research, and RedHat, enables KV-Cache-aware routing and prefix-matching to boost hit rates and GPU utilization
- Real-world benchmarks showing latency gains, compute savings, and scalability under high-concurrency, AI workloads

We’ll walk through key system design choices, from memory locality and CPU offloading to coordinated routing for sessionful AI workloads - ideal for anyone serving LLMs in production or building scalable inference platforms.

A deep dive into how cache reuse, attention prefix-cache aware scheduling, and intelligent routing can dramatically improve latency, throughput, and GPU utilization in PyTorch-based LLM serving—powered by lessons from llm-d, a production-grade open source framework.

Routing Stateful AI Workloads in Kubernetes

Kubernetes excels at stateless service routing - but modern AI workloads are not stateless. Generative workloads demand context-aware routing that maximizes performance while reducing costs.

This talk explores layered routing strategies for stateful LLM workloads on Kubernetes - from round-robin to full KV-Cache-aware load balancing. We’ll explain when each level applies, and its effects on performance.

Based on our experience developing llm-d - a framework using the K8s Gateway API Inference Extension, a collaboration between Google, IBM Research, and RedHat - we’ll cover:
- Why traditional Kubernetes routing falls short for generative AI
- Routing patterns for long-context, sessionful traffic
- Global cache indices and local offloading for smart routing
- Benchmarks showing latency, cache hit rates, and GPU utilization
- Practical ways to adopt cache-aware routing without major infra changes

If you’re scaling multi-turn, agentic, or LLM-powered workloads, this session is for you.

This talk distills our experience building llm-d, focusing on progressive levels of context-aware routing for LLMs - from stateless to cache-aware. Attendees will leave with practical patterns for scaling generative AI serving on Kubernetes.

Serving PyTorch LLMs at Scale: Disaggregated Inference with Kubernetes and llm-d

As PyTorch-based LLMs scale in complexity and user concurrency, their inference demands diverge across stages. Prefill is compute-heavy; decode is latency-sensitive. In this talk, we introduce a disaggregated serving pattern for PyTorch LLMs using llm-d—a Kubernetes-native, open-source framework co-developed by IBM Research, Google, and Red Hat. We'll walk through how llm-d separates prefill and decode into orchestrated sidecars, improving GPU utilization and QoS alignment. You'll learn how the Gateway API Inference Extension (GIE) enables routing based on load, cache locality, and session affinity. The talk includes real-world benchmarks and a visual demo of llm-d serving PyTorch models with vLLM across heterogeneous hardware on Kubernetes.

Open Source Summit + Embedded Linux Conference North America 2026 Sessionize Event Upcoming

May 2026 Minneapolis, Minnesota, United States

PyTorch Conference Europe 2026 Sessionize Event Upcoming

April 2026 Paris, France

KubeCon + CloudNativeCon Europe 2026 Sessionize Event

March 2026 Amsterdam, The Netherlands

KubeCon + CloudNativeCon North America 2025 Sessionize Event

November 2025 Atlanta, Georgia, United States

PyTorch Conference 2025 Sessionize Event

October 2025 San Francisco, California, United States

Maroon Ayoub

Research Scientist & Architect, IBM Research

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top