Che Yang
senior engineer
Actions
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor of communities like Kubernetes, docker, and Kubeflow. Yang is the co-founder and the maintainer of Fluid,which is the CNCF Sandbox project.
Mastering Prefill-Decode-Disaggregated Architecture: Solutions and Best Practices in Alibaba Cloud
Disaggregating the prefill and decoding phases in LLM inference has garnered significant attention in the industry because it can enhance performance. Several solutions have been developed, including Mooncake, TetriInfer, Splitwise, DistServe, and RTP-LLM. However, deploying a disaggregation LLM inference at scale on Kubernetes, while evaluating its performance and cost benefits presents numerous challenges.
In this talk, we will introduce a solution that uses a LeaderWorkerSet as the workload, an Ingress Controller and a node discovery service. It can deploy disaggregated PD on Kubernetes, supporting multiple LLM inference engines like Mooncake and RTP-LLM with zero intrusion. Furthermore, we will discuss improving load balancing using Envoy and ORCA, based on KVCache and metrics, and recommending optimal ratios for the PD phases. Finally, we will cover essential features for production deployment such as high availability, elastic scaling, canary releases, and observability.
From Zero to Infinity:How AI-powered hedge fund build cloud-native AI Platform on Kubernetes
Metabit Trading is an AI-powered quantitative investment firm that builds their research platform on K8s. However, their computing platform often faces sudden tasks requiring scaling from 0 to 500 pods for the concurrent access of data from distributed storage systems. As these systems are rating-limited and slow, they significantly hamper training performance and limit compute scalability.
To tackle this, they used Fluid, a CNCF project, and JuiceFS to build an elastic distributed cache solution. In this session, experts from Metabit and Fluid will discuss how to achieve automatic scaling in production environments by creating an elastic cache cluster and using Prometheus to set up a strategy based on behavioral characteristics of cache usage, ultimately reaching 1000Gbps. They will also cover using Fluid with CronHPA for timed autoscaling to balance cost and performance while evaluating the performance and cost benefits of scaling, and present a demo showcasing the solution.
Fluid:Data Anyway, Data Anywhere, Data Anytime
Fluid is an open-source project for orchestrating data and workloads in Kubernetes. In the 2024 CNCF Technology Radar Report, Fluid is recognized as an "Adopted" project in the cloud-native AI landscape, considered ready for use by developers without further evaluation.
Maintainer from the Fluid community will reveal why it is so popular, detailing its architecture and the "Data Anyway, Anywhere, Anytime" features. He will also showcase the dynamic data mounting capabilities beneficial for data scientists, along with insights into future feature plans.
Empower Large Language Models (LLMs) Serving in Production With Cloud Native AI Technologies
LLMs have heightened public expectations of generative models. However, as noted in the Gartner report, running AI applications in production poses significant challenges.
To tackle the challenges, we have redesigned and optimized the software capabilities of Cloud Native AI Technologies. By extending KServe to handle OpenAI's streaming requests, it can accommodate the inference load of LLM. With Fluid and Vineyard, It shows a result of reducing Llama-30B model loading time from 10 minutes to under 25 seconds.
However, the above optimizations do not stop there. Since LLM loading is not a high-frequency operation,It is crucial to utilize cronHPA for timed auto-scaling in order to achieve a balance between cost and performance, and to evaluate the cost-effectiveness of the scaling process.
As KServe and Fluid's reviewer and maintainer, we share our insights on the challenges in the session. We will showcase effective use of Cloud Native AI and share our experiences in production.
Boundaryless Computing: Optimizing LLM Performance, Cost, and Efficiency in Multi-cloud Architecture
For large language model (LLM) inference, GPU resources within a single data center or cloud region often cannot meet all user demands. Additionally, for the end-users, deploying across multiple geographic regions is necessary to provide an optimal user experience. However, managing model distribution, synchronization, and consistency across multiple regions presents new challenges. To address this, the OCM and Fluid communities have collaborated to automate the multi-region distribution of inference applications through OCM's multi-cluster application deployment capabilities, combined with Fluid's data orchestration capabilities. This automation facilitates the cross-regional distribution and pre-warming of large models, enhancing the efficiency of model deployment and upgrades.
Che Yang
senior engineer
Actions
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top