Session

Defying GPU Scarcity: Strategies for Efficient Serving with Smaller GPUs

Let’s face it, we’re in a constant GPU shortage. Getting access to enough GPUs, especially the largest and latest, is a huge challenge. This talk explores strategies for maximizing the performance and efficiency of serving LLMs on multiple GPUs, particularly older and smaller ones that are more readily available.

Sharding techniques allow for the partitioning of a workload across multiple smaller GPUs, thereby enabling the completion of the work with multiple smaller GPUs. In addition, quantization techniques can be used to reduce memory usage for a workload by trading off a small amount of precision, which is often acceptable in serving use cases.

Join us to delve into the practical aspects of implementing these techniques, providing insights into the trade-offs involved and the potential price to performance gains achievable. By effectively utilizing multiple GPUs, organizations can overcome resource availability limitations to harness the power of LLMs.

Mofi Rahman

Developer Relations Engineer, Google

New York City, New York, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top