Session
Optimizing LLM Inference for the Rest of Us
Not every organization operates with the hyperscale resources of Anthropic, Google, or OpenAI. For the majority of businesses integrating Large Language Models (LLMs) into their critical paths, the high costs and scarcity of GPU/TPU accelerators present a significant challenge. Striking the balance between performance, availability, scalability, and cost-efficiency is a must.
While Kubernetes is a ubiquitous runtime for modern workloads, deploying LLM inference effectively demands a specialized approach. This session dives deep into practical strategies for optimizing your Kubernetes clusters and LLM Inference workloads to run efficiently and cost effectively. We will explore:
- Container and Model Optimization
- Accelerator Management
- Data & Storage
- Network & Load Balancing
- Observability
Attendees will leave with practical techniques for maximizing cost/performance for LLM inference for their AI-powered applications on Kubernetes.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top