Session

From Pull to Predict: Accelerating AI Model Deployment from minutes to seconds on Kubernetes

In the era of large AI models, deployment latency and resource utilization present significant challenges for Kubernetes operators. This session demonstrates techniques to reduce model startup times and optimize cluster resources. We'll deploy a 7B parameter LLM using Ray and vLLM for scaling and serving, implementing three key optimizations: SOCI (Seekable OCI) for lazy loading of container images, enabling containers to start without downloading the entire image first; an optimized storage layer that keeps models pre-downloaded and ready for quick access; and intelligent node provisioning using Karpenter for dynamic resource allocation. We'll compare a standard deployment against one using these optimizations, showing the differences in startup times, resource usage, and operational costs. Attendees will learn implementation steps for these techniques, which they can apply to their own Kubernetes environments to improve AI model deployment efficiency.

Tiago Reichert

Principal Specialist SA, AppMod

São Paulo, Brazil

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top