No More GPU Cold Starts: Making Serverless ML Inference Truly Real-Time

Serverless ML inference is great but when GPUs are involved, cold starts can turn milliseconds into minutes. Whether scaling transformer models or using custom inference services, the startup latency caused by container initialization, GPU driver loading, and heavyweight model deserialization can kill real-time performance and cost you tons of money.

In this talk, we'll break down the anatomy of GPU cold starts in modern ML serving stacks including why GPUs introduce unique cold-path delays, how CRI and device plugins contribute to it, and what really happens when a PyTorch model boot-up on a fresh pod.

We’ll walk through production-ready strategies to reduce startup latency:
- Pre-warmed GPU pod pools to bypass init time
- Model snapshotting with TorchScript or ONNX to speed up deserialization
- Lazy loading techniques that delay model initialization until the first request

Thus helping you eliminate cold start pain and keep your services fast, efficient, and production-ready.

Nikunj Goyal

Member of Technical Staff

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

No More GPU Cold Starts: Making Serverless ML Inference Truly Real-Time

Nikunj Goyal

Links

Actions