Optimizing LLM Inference for the Rest of Us

Not every organization operates with the hyperscale resources of Anthropic, Google, or OpenAI. For the majority of businesses integrating Large Language Models (LLMs) into their critical paths, the high costs and scarcity of GPU/TPU accelerators present a significant challenge. Striking the balance between performance, availability, scalability, and cost-efficiency is a must.

While Kubernetes is a ubiquitous runtime for modern workloads, deploying LLM inference effectively demands a specialized approach. This session dives deep into practical strategies for optimizing your Kubernetes clusters and LLM Inference workloads to run efficiently and cost effectively. We will explore:

- Container and Model Optimization
- Accelerator Management
- Data & Storage
- Network & Load Balancing
- Observability

Attendees will leave with practical techniques for maximizing cost/performance for LLM inference for their AI-powered applications on Kubernetes.

Abdel Sghiouar

Cloud Developer Advocate

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Optimizing LLM Inference for the Rest of Us

Abdel Sghiouar

Links

Actions