Session

Decoding and Taming the Soaring Costs of Large Language Models

Running generative AI applications can be prohibitively expensive. This talk unravels the soaring costs of inference - providing real-time responses to user queries using large language models with billions or trillions of parameters.

We estimate the staggering costs to serve individual user queries by accessing massive models, and delve into the performance and cost challenges, from GPU hardware accelerators to latencies in running ChatGPT.

We explore the potential for improving resource efficiency and performance of running large-scale AI applications on Kubernetes and in cloud-native environments through GPU resource sharing, advanced scheduling, and dynamic batching.

We hope this talk will spur further discussion and collaboration within the CNCF community around taming the costs of deploying and scaling generative AI using cloud native technologies and best practices.

Yuan Chen

Nvidia, Software Engineer, Kubernetes, Scheduling, GPU, AI/ML, Resource Management

San Jose, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top