Mengxuan Li

4paradigm

Actions

Member of volcano community

responsible for the development of gpu virtualization mechanism on volcano. It have been merged in the master branch of volcano, and will be released in v1.8.

speaker, in OpenAtom Global Open Source Commit#2023

K8s issue #52757: Sharing GPUs among multiple containers

This issue has plagued Kubernetes for nearly 8 years: K8s issue #52757. The challenge of flexibly sharing GPUs across multiple containers is particularly prominent in AI scenarios, where inference tasks are typically short-lived. As a result, resource utilization becomes a critical concern.

In this talk, we will share solutions and practices for implementing GPU sharing in Kubernetes, focusing on two key projects gaining traction recently: Dynamic Resource Allocation (DRA) and the CNCF sandbox project HAMi. The presentation will cover the following topics:
1. Challenges in GPU sharing.
2. Approaches for sharing AI chips beyond NVIDIA GPUs.
3. How sharing technologies integrate with projects like Volcano, Koordinator, and Kueue.

Unlocking how to efficiently, flexibly, manage and schedule seven AI chips in Kubernetes

There are more and more AI accelerator manufacturers emerged in recent years. Data centers often face scenarios where multiple AI accelerators from different vendors exist at the same time, such as Nvidia and AMD, Intel, etc..
Therefore, managing these heterogeneous devices face bigger challenges. The CNCF sandbox project HAMi (Heterogeneous AI Computing Virtualization Middleware) was officially born for this purpose.
This session will focus on efficiently managing heterogeneous AI chips through HAMi in Kubernetes clusters
* A unified scheduler which capable of topology-aware, numa-aware, supports binpack and spread schedule policy on 7 AI accelerators.
* Virtualization on 6 AI accelerators
* Task priority
* Memory oversubscription on k8s GPU tasks
* Observability in two dimensions: allocated resources and real usage
* HAMi+Volcano/Koordinator for collaborative orchestration and scheduling capabilities on AI batch tasks
* HAMi+Kueue for practice in training and inference scenarios

Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi

With AI's growing popularity, Kubernetes has become the de facto AI infrastructure. However, the increasing number of clusters with diverse AI devices (e.g., NVIDIA, Intel, Huawei Ascend) presents a major challenge.
AI devices are expensive, how to better improve resource utilization? How to better integrate with K8s clusters? How to manage heterogeneous AI devices consistently, support flexible scheduling policies, and observability all bring many challenges
The HAMi project was born for this purpose. This session including:
* How K8s manages heterogeneous AI devices (unified scheduling, observability)
* How to improve device usage by GPU share
* How to ensure the QOS of high-priority tasks in GPU share stories
* Support flexible scheduling strategies for GPU (NUMA affinity/anti-affinity, binpack/spread etc)
* Integration with other projects (such as volcano, scheduler-plugin, etc.)
* Real-world case studies from production-level users.
* Some other challenges still faced and roadmap

Cloud Native Batch Computing With Volcano: Updates and Future

Volcano is a cloud native batch platform and CNCF's first container batch computing project. It is optimized for AI and Bigdata by providing
the following capabilities:
- Full lifecycle management for jobs
- Scheduling policies for batch workloads
- Support for heterogeneous hardware
- Performance optimization for high performance workloads

This community has integrated with computing ecosystem like spark, flink, kubeflow, ray in big data and AI domains. and The project has been deployed by 50+ users in their production environment.

This year Volcano contributors have made great progress to help users to address challenges for LLM training and inferences. A number of new features are on the way to accelerate the GPU/Ascend NPU training efficiency, optimize resource utilization for large scale clusters and provides fine-grained scheduling.

This talk will presents the latest progress, new features, use cases, new sub-projects and the future of the community.

Mengxuan Li

4paradigm

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Mengxuan Li

Actions

Links

Sessions

K8s issue #52757: Sharing GPUs among multiple containers

Unlocking how to efficiently, flexibly, manage and schedule seven AI chips in Kubernetes

Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi

Cloud Native Batch Computing With Volcano: Updates and Future

Mengxuan Li

Links

Actions