Session

Accelerating AI Inference at Scale: A Deep Dive Into NVIDIA Dynamo on Kubernetes

As foundation models move toward deeper test-time computation, inference becomes the dominant scaling constraint. Latency, throughput, and cost are governed by a small set of forces: autoregressive decoding, KV-cache growth, memory bandwidth, and scheduling under contention. This workshop frames large-scale inference through these emerging laws of inference, starting from first principles and building toward real systems. Learners deploy NVIDIA Dynamo on Kubernetes to operate aggregated and disaggregated inference architectures using built-in KV-aware routing and scheduling. The outcome is a principled understanding of where inference time and money go — and how architectural choices bend those curves in production. Participants will deploy both aggregated and disaggregated inference on a 4xA100 node and compare the performance of the two.

Anshul Jindal

Sr. Solution Architect

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top