From Model to Production: Efficient LLM Deployment with Kubernetes

Unlike deploying a simple web application on Kubernetes, serving AI/ML workloads involves a series of complex decisions to ensure they run efficiently and reliably in production. These decisions include things like understanding the model architecture, selecting the right hardware, choosing the appropriate inference backend or server, implementing effective auto-scaling, benchmarking the models’s performance, and optimizing for scaleable and reliable deployments.

In this session, we’ll walk you through the key steps—from having a trained/fine-tuend model to running it on Kubernetes in a scalable, cost-efficient, and high-performance manner. You’ll leave with a clear understanding of the practical architectural choices required to build and operate production-grade inference workloads on Kubernetes.

Audience Takeaways:

1. Understand the core architectural decisions that impact how AI/ML workloads run on Kubernetes in production.
2. Gain practical guidance for deploying AI/ML workloads in a scalable, efficient, and optimized way.
3. Learn how the Kubernetes ecosystem and its specific features simplify running production-grade inference at scale.

Tsahi Duek

Amazon Web Services, Principal Specialist Solutions Architect, Containers

City of London, United Kingdom

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

From Model to Production: Efficient LLM Deployment with Kubernetes

Tsahi Duek

Links

Actions