Tiny Models Big Ideas : Quantization for Smarter Inference

With the rise of on-device intelligence, the push to run LLMs on edge hardware — phones, Raspberry Pis, even microcontrollers — is accelerating. At the heart of this revolution is quantization: the art of shrinking models without shrinking their intelligence.

This talk breaks down quantization by walking through how it’s evolved from basic tricks to the smart, low-bit methods powering today’s compact LLMs. We'll unpack Quantization tracing how post-training quantization (PTQ) and quantization-aware training (QAT) evolved into smarter methods like GPTQ, AWQ, and SmoothQuant, each balancing performance, accuracy, and deployability.

We’ll also dig into the growing toolbox of frameworks that are making it easier than ever to get these models running fast on real hardware — including vLLM, TensorRT-LLM, GGML, and MLC-LLM.

To wrap it up, we’ll look at real-world examples of quantized LLMs running on edge devices — and see what actually works, what breaks, and how far you can push performance without blowing up memory or latency. If you’re curious about how much model you can fit into a few megabytes — and still get useful completions — this talk is for you.

Nikunj Goyal

Member of Technical Staff

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Tiny Models Big Ideas : Quantization for Smarter Inference

Nikunj Goyal

Links

Actions