Speaker

Maroon Ayoub

Maroon Ayoub

AI Research Engineer, IBM Research

Actions

Maroon Ayoub is a systems engineer at IBM Research focused on distributed AI infrastructure. He co-leads development of llm-d and specializes in scaling LLM inference with Kubernetes-native architectures, caching strategies, and open source integrations.

Serving Transformers at Scale: KV-Cache-Aware Routing for PyTorch Inference

Large-scale language model inference is dominated not just by FLOPs - but by cache. As transformer models scale in context length and session complexity, reusing key-value attention cache (KV-Cache) becomes critical for reducing latency, cost, and improving throughput.

This talk presents a KV-Cache-centric approach to scalable PyTorch-based inference, showing how cache reuse and intelligent request routing dramatically improve performance.

We’ll explore:
- The impact of KV-Cache offloading and reuse on latency and resource efficiency in long-context inference
- Load balancing challenges under cache constraints, and routing strategies that maximize reuse
- How llm-d, an open-source project by Google, IBM Research, and RedHat, enables KV-Cache-aware routing and prefix-matching to boost hit rates and GPU utilization
- Real-world benchmarks showing latency gains, compute savings, and scalability under high-concurrency, AI workloads

We’ll walk through key system design choices, from memory locality and CPU offloading to coordinated routing for sessionful AI workloads - ideal for anyone serving LLMs in production or building scalable inference platforms.

A deep dive into how cache reuse, attention prefix-cache aware scheduling, and intelligent routing can dramatically improve latency, throughput, and GPU utilization in PyTorch-based LLM serving—powered by lessons from llm-d, a production-grade open source framework.

Routing Stateful AI Workloads in Kubernetes

Kubernetes excels at stateless service routing - but modern AI workloads are not stateless. Generative workloads demand context-aware routing that maximizes performance while reducing costs.

This talk explores layered routing strategies for stateful LLM workloads on Kubernetes - from round-robin to full KV-Cache-aware load balancing. We’ll explain when each level applies, and its effects on performance.

Based on our experience developing llm-d - a framework using the K8s Gateway API Inference Extension, a collaboration between Google, IBM Research, and RedHat - we’ll cover:
- Why traditional Kubernetes routing falls short for generative AI
- Routing patterns for long-context, sessionful traffic
- Global cache indices and local offloading for smart routing
- Benchmarks showing latency, cache hit rates, and GPU utilization
- Practical ways to adopt cache-aware routing without major infra changes

If you’re scaling multi-turn, agentic, or LLM-powered workloads, this session is for you.

This talk distills our experience building llm-d, focusing on progressive levels of context-aware routing for LLMs - from stateless to cache-aware. Attendees will leave with practical patterns for scaling generative AI serving on Kubernetes.

Maroon Ayoub

AI Research Engineer, IBM Research

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top