Chen Zicong

CNCF Volcano Maintainer, LeaderWorkerSet Contributor & R&D Engineer at Huawei Cloud

Hangzhou, China

Actions

Zicong Chen is the CNCF project Volcano maintainer and an R&D Engineer at Huawei Cloud specializing in cloud-native scheduling. He possesses deep expertise in this domain and is dedicated to solving the scheduling challenges posed by complex workloads like AI/HPC. As an active community leader, he regularly hosts the Volcano community meetings and has spoken at technical conferences including KCD Beijing, GOSIM Hangzhou, Kubecon EU 26 etc. His understanding extends across the AI ecosystem; contributions to projects like LeaderWorkerSet have given him deep insights into managing the unique demands of inference workloads.

Area of Expertise

Information & Communications Technology

Your Cluster Isn't Flat: A First-Class API for Real-World Infrastructure Topology

Kubernetes's flat-node model forces us to "simulate" hierarchy with labels, not truly "model" it. This creates severe challenges in heterogeneous clusters: not just operational complexity, but critically, native tools like NodeAffinity and TopologySpreadConstraints cannot treat a rack or node pool as a single unit for capacity assessment or bin-packing.

This session introduces an architecture that elevates infrastructure hierarchy to a "first-class citizen" via the HyperNode API. Its core is a pluggable, automated discovery engine that onboards clusters via existing labels and integrates directly with hardware controllers like NVIDIA UFM.

Using Volcano as a case study, we will show how this mechanism solves complex scheduling challenges that span both network topology and hardware constraints. We believe this is not just an evolution for Volcano, but a proposal to establish physical-world modeling as a universal core abstraction for the CNCF community.

Volcano: Orchestrating the Full AI Lifecycle – From Training to Inference and Agents

The rapid evolution of AI has led to infrastructure fragmentation, where training, inference, and agent workloads run in isolated systems, causing resource inefficiency. Volcano addresses this as a Unified Scheduling Platform for the full AI lifecycle, delivering robust scheduling capabilities with high throughput.

Volcano is evolving into the next-generation platform capable of orchestrating diverse workloads beyond batch jobs, enabling multi-scheduler coordination.

At the workload layer:
- Volcano-Global splits massive training jobs across clusters, removing single-cluster limits
- Kthena delivers enterprise-grade LLM serving with frameworks like vLLM
- AgentCube enables rapid agent workload scheduling

At the infra layer, Volcano provides modern resource abstraction through DRA integration, HyperNode discovery, GPU sharing, and heterogeneous pooling for efficient task-to-accelerator mapping.

Join us to explore how Volcano is shaping the future of Cloud Native AI infra.

AI 编排能力提升：基于昇腾硬件的智能调度方案

当前硬件技术日新月异，市面上的硬件平台并非都提供 GPU 兼容接口，这给异构硬件的高效利用带来了巨大挑战。本次分享将着重介绍我们在昇腾硬件上实现 AI 工作负载智能调度的实践经验。
昇腾 NPU 采用专门面向神经网络计算优化的达芬奇核心，这与 GPU 的 CUDA 架构有着本质区别。基于这种架构特点，我们需要为 vLLM 和 Ray 等框架提供原生支持，以便将硬件节点作为调度资源进行统一管理。在此基础上，我们将详细介绍如何实现Gang Scheduling和Dominant Resource Fairness等高级调度机制，从而更好地支持复杂 AI 工作负载的资源编排。
我们的方案有效解决了异构硬件环境下 AI 服务面临的关键挑战。通过实际案例，我们将展示这些优化如何显著提升了生产环境中 AI 工作负载的管理效率。

Chen Zicong

CNCF Volcano Maintainer, LeaderWorkerSet Contributor & R&D Engineer at Huawei Cloud

Hangzhou, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Chen Zicong

Actions

Links

Area of Expertise

Sessions

Your Cluster Isn't Flat: A First-Class API for Real-World Infrastructure Topology

Volcano: Orchestrating the Full AI Lifecycle – From Training to Inference and Agents

AI 编排能力提升：基于昇腾硬件的智能调度方案

Chen Zicong

Links

Actions