
Kante Yin
HivergeAI, Founding Engineer
London, United Kingdom
Actions
Kante is the founding engineer from HivergeAI, his work is mostly around AI agents and LLM inference these days. He also works on upstream Kubernetes as maintainers for several projects and the founder of InftyAI, an opensource community commited to building AI infrastructures.
Sailing multi-host inference for LLM On Kubernetes
Inference workloads are becoming increasingly prevalent and vital in Cloud Native world. However, it's not easy, one of the biggest challenges is large foundation model can not fit into a single node, such as llama3.1-405B or DeepSeek R1, which brings out the distributed inference with model parallelism, again, make serving inference workloads more complicated.
LeaderWorkerSet, aka. LWS, is a dedicated multi-host inference project aims to solve this problem, it's a project under the guidance of Kubernetes SIG-Apps and Serving Working Group. It offers a couple of features like dual-template for different types of Pods, fine-gained rolling update strategies, topology managements and all-or-nothing failure handlings.
What's more, vLLM, an inference engine, renowned for its performance and easy-to-use, has gained widespread popularity. In this presentation, we'll show you how to use LWS to deploy distributed inference with vLLM on Kubernetes.
Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on vLLM
With the growing demand for heterogeneous computing power, Chinese users are gradually adopting domestic GPUs, especially for inference. vLLM, the most popular open-source inference project, has drawn widespread attention but does not support domestic chips.Chinese inference engines are still developing in functionality, performance, and ecosystem. In this session, we’ll introduce how to adapt vLLM to support domestic GPUs,enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill. We’ll also cover performance bottleneck analysis and chip operator development to maximize hardware potential.
Additionally, Kubernetes has become the standard for container orchestration and is the preferred platform for inference services. We’ll show how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with a few lines of code, and explore how llmaz handles heterogeneous GPU scheduling and our practices for monitoring and elastic scaling.
Panel: Fragmentation of the Scheduling in Kubernetes and Challenges for AI/ML workloads
Scheduler is one of the most frequently customized components in Kubernetes, owing to its expandability. However, too many schedulers lead to decision paralysis among users, which has been discussed extensively in the past KubeCons. To help mitigate the confusion of users, four maintainers from various community (Godel-Scheduler, Koordinator, Kubernetes SIG-Scheduling and Volcano) are invited to profile the background and usecases behind these projects.
Also the panel will discuss the gap between upstream Kubernetes and downstream projects and try to abstract the common patterns or functionalities which can be pushed to the upstream to avoid reimplementing the wheel, and what should still be defined loosely to preserve the expandability.
Moreover, with the rise of AI, scheduling AI workloads in Kubernetes poses a significant challenge, the panel will discuss where we're right now and where we're head for, as well as the opportunities of cooperations.
SIG-Scheduling Intro & Deep Dive
Kube-scheduler is a critical component to Kubernetes, responsible for placing the pod to the most suitable node. But how it works, can we customize it for advanced usage, what’s the best practice in large clusters. To answer these progressive questions, we’ll divide this session into two parts. If you’re a newbie to kube-scheduler, you may interest with the Intro part, if you’re a senior one, you can join our Deep Dive.
What’s more, we’ll share with you some ongoing works within the SIG, including the latest progress with the sub-projects.
SIG-Scheduling Intro & Deep Dive
Kube-scheduler is a critical component to Kubernetes, responsible for placing the pod to the most suitable node. But how it works, can we customize it for advanced usage, what’s the best practice in large clusters. To answer these progressive questions, we’ll divide this session into two parts. If you’re a newbie to kube-scheduler, you may interest with the Intro part, if you’re a senior one, you can join our Deep Dive.
What’s more, we’ll share with you some ongoing works within the SIG, including the latest progress with the sub-projects.
Sailing Ray workloads with KubeRay and Kueue in Kubernetes
Compute demands for machine learning are growing rapidly nowadays. Ray, a unified computing framework, allows ML engineers to scale their workloads effortlessly without building complex computing infrastructures.
On the other hand, Kubernetes, a popular open-source container orchestration platform, can help to manage a wide range of workloads at ease with KubeRay, an operator for Ray workloads.
At ByteDance, thousands of jobs are submitted to the Ray cluster created by KubeRay daily. With the capability to debug programs on long-running clusters and launch regular jobs through Ray Job custom resources, users benefit from a streamlined workflow.
Meanwhile, efficiently managing concurrent Ray jobs poses challenges such as job starvation and resource allocation. Kueue, a Kubernetes native job queueing system offering capacities like resource management, multi-tenant support, and resource fair-sharing perfectly addresses the Ray job challenges in Kubernetes.
KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 Sessionize Event
Maintainer Track + ContribFest: KubeCon + CloudNativeCon Europe 2024 Sessionize Event
KubeCon + CloudNativeCon + Open Source Summit China 2023 Sessionize Event
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top