Giving vLLM a Memory: Building a Stateful Agent Gateway

vLLM delivers exceptional high-throughput inference, but it is fundamentally stateless: each request is isolated, with no memory of prior responses, tool calls, or multi-step reasoning. This is fine for simple chat, but limits production agentic workloads like coding assistants and multi-step orchestration.

This talk introduces vllm-project/agentic-api — a lightweight gateway in front of vLLM that implements the OpenAI Responses API (POST /v1/responses), enabling stateful conversation management without modifying vLLM. It can be enabled with a single flag.

We’ll cover key design decisions:

Gateway layer: dual-mode (passthrough vs managed)
Orchestration engine: pydantic-ai for multi-turn execution and tool calls
Protocol translation: decoupling internal events from external APIs
State management: previous_response_id and conversation_id for rehydration, branching, and replay
Kubernetes readiness: multi-replica deployment, DB-backed consistency, GPU-free E2E testing

XingYan Jiang

DaoCloud, Software Engineer, Cloud Native Enthusiast

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Giving vLLM a Memory: Building a Stateful Agent Gateway

XingYan Jiang

Links

Actions