Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

In the dynamic landscape of AI/ML, deploying and orchestrating large open-source inference models on Kubernetes has become paramount. This talk delves into the intricacies of automating the deployment of heavyweight models like Falcon and Llama 2, leveraging Kubernetes Custom Resource Definitions (CRDs) to manage large model files seamlessly through container images. The deployment is streamlined with an HTTP server facilitating inference calls using the model library.

This session will explore eliminating manual tuning of deployment parameters to fit GPU hardware by providing preset configurations. Learn how to auto-provision GPU nodes based on specific model requirements, ensuring optimal utilization of resources. We'll discuss empowering users to deploy their containerized models effortlessly by allowing them to provide a pod template in the workspace custom resource inference field. The controller dynamically, in turn, creates deployment workloads utilizing all GPU nodes.

Joinal Ahmed

AI Evangalist

Bengaluru, India

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

Joinal Ahmed

Links

Actions