Large Scale Distributed LLM Inference with LLM-D and Kubernetes

Running Large Language Models (LLMs) locally for experimentation is easy but running them in large scale architectures is not. It requires businesses looking to intergate LLMs into their critical paths to deal with the high costs and scarcity of GPU/TPU accelerators present a significant challenge. Striking the balance between performance, availability, scalability, and cost-efficiency is a must.

While Kubernetes is a ubiquitous runtime for modern workloads, deploying LLM inference effectively demands a specialized approach. Enter LLM-D a Cloud Native Kubernetes based high-performance distributed LLM inference framework. It's architecture centers around a well-lit path for anyone looking to serve at scale, with the fastest time-to-value and competitive performance per dollar, for most models across a diverse and comprehensive set of hardware accelerators.

Abdel Sghiouar

Cloud Developer Advocate

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Large Scale Distributed LLM Inference with LLM-D and Kubernetes

Abdel Sghiouar

Links

Actions