Empower Large Language Models (LLMs) Serving in Production With Cloud Native AI Technologies

LLMs have heightened public expectations of generative models. However, as noted in the Gartner report, running AI applications in production poses significant challenges.
To tackle the challenges, we have redesigned and optimized the software capabilities of Cloud Native AI Technologies. By extending KServe to handle OpenAI's streaming requests, it can accommodate the inference load of LLM. With Fluid and Vineyard, It shows a result of reducing Llama-30B model loading time from 10 minutes to under 25 seconds.
However, the above optimizations do not stop there. Since LLM loading is not a high-frequency operation,It is crucial to utilize cronHPA for timed auto-scaling in order to achieve a balance between cost and performance, and to evaluate the cost-effectiveness of the scaling process.
As KServe and Fluid's reviewer and maintainer, we share our insights on the challenges in the session. We will showcase effective use of Cloud Native AI and share our experiences in production.

Che Yang

senior engineer

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.