The Golden Triangle of Inference Optimization: Balancing Latency, Throughput, and Quality

Running high-performance production deployments for LLMs and other generative AI models requires balancing latency, throughput, and quality given constraints around hardware, model selection, and cost.

Assembling the correct set of optimization methods, from model-level optimizations like quantization and speculative sampling to fast inference frameworks like vLLM/TensorRT-LLM/SGLang to novel methods like prefix caching and chunked context is tricky, but is essential for serving AI models effectively at scale.

This talk introduces a wide variety of proven optimization techniques, but more importantly provides a clear framework for categorizing, understanding, and selecting which techniques to use based on the objectives you're optimizing towards.

Philip Kiely

Head of Developer Relations at Baseten

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

The Golden Triangle of Inference Optimization: Balancing Latency, Throughput, and Quality

Philip Kiely

Links

Actions