Maroon Ayoub

Research Scientist & Architect, IBM Research

Actions

Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, caching strategies, and open source integrations.

Serving Transformers at Scale: KV-Cache-Aware Routing for PyTorch Inference

Large-scale language model inference is dominated not just by FLOPs - but by cache. As transformer models scale in context length and session complexity, reusing key-value attention cache (KV-Cache) becomes critical for reducing latency, cost, and improving throughput.

This talk presents a KV-Cache-centric approach to scalable PyTorch-based inference, showing how cache reuse and intelligent request routing dramatically improve performance.

We’ll explore:
- The impact of KV-Cache offloading and reuse on latency and resource efficiency in long-context inference
- Load balancing challenges under cache constraints, and routing strategies that maximize reuse
- How llm-d, an open-source project by Google, IBM Research, and RedHat, enables KV-Cache-aware routing and prefix-matching to boost hit rates and GPU utilization
- Real-world benchmarks showing latency gains, compute savings, and scalability under high-concurrency, AI workloads

We’ll walk through key system design choices, from memory locality and CPU offloading to coordinated routing for sessionful AI workloads - ideal for anyone serving LLMs in production or building scalable inference platforms.

A deep dive into how cache reuse, attention prefix-cache aware scheduling, and intelligent routing can dramatically improve latency, throughput, and GPU utilization in PyTorch-based LLM serving—powered by lessons from llm-d, a production-grade open source framework.

Routing Stateful AI Workloads in Kubernetes

Kubernetes excels at stateless service routing - but modern AI workloads are not stateless. Generative workloads demand context-aware routing that maximizes performance while reducing costs.

This talk explores layered routing strategies for stateful LLM workloads on Kubernetes - from round-robin to full KV-Cache-aware load balancing. We’ll explain when each level applies, and its effects on performance.

Based on our experience developing llm-d - a framework using the K8s Gateway API Inference Extension, a collaboration between Google, IBM Research, and RedHat - we’ll cover:
- Why traditional Kubernetes routing falls short for generative AI
- Routing patterns for long-context, sessionful traffic
- Global cache indices and local offloading for smart routing
- Benchmarks showing latency, cache hit rates, and GPU utilization
- Practical ways to adopt cache-aware routing without major infra changes

If you’re scaling multi-turn, agentic, or LLM-powered workloads, this session is for you.

This talk distills our experience building llm-d, focusing on progressive levels of context-aware routing for LLMs - from stateless to cache-aware. Attendees will leave with practical patterns for scaling generative AI serving on Kubernetes.

Serving PyTorch LLMs at Scale: Disaggregated Inference with Kubernetes and llm-d

As PyTorch-based LLMs scale in complexity and user concurrency, their inference demands diverge across stages. Prefill is compute-heavy; decode is latency-sensitive. In this talk, we introduce a disaggregated serving pattern for PyTorch LLMs using llm-d—a Kubernetes-native, open-source framework co-developed by IBM Research, Google, and Red Hat. We'll walk through how llm-d separates prefill and decode into orchestrated sidecars, improving GPU utilization and QoS alignment. You'll learn how the Gateway API Inference Extension (GIE) enables routing based on load, cache locality, and session affinity. The talk includes real-world benchmarks and a visual demo of llm-d serving PyTorch models with vLLM across heterogeneous hardware on Kubernetes.

KubeCon + CloudNativeCon North America 2025 Sessionize Event

November 2025 Atlanta, Georgia, United States

PyTorch Conference 2025 Sessionize Event

October 2025 San Francisco, California, United States

Maroon Ayoub

Research Scientist & Architect, IBM Research

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Maroon Ayoub

Actions

Links

Sessions

Serving Transformers at Scale: KV-Cache-Aware Routing for PyTorch Inference

Routing Stateful AI Workloads in Kubernetes

Serving PyTorch LLMs at Scale: Disaggregated Inference with Kubernetes and llm-d

Events

KubeCon + CloudNativeCon North America 2025 Sessionize Event

PyTorch Conference 2025 Sessionize Event

Maroon Ayoub

Links

Actions