Weizhou Lan

daocloud network cloud-native

Shanghai, China

Actions

I currently serve as a Senior Tech Lead at DaoCloud, with over 15 years of engineering experience, including 8 years of Kubernetes expertise. My technical focus spans networking, eBPF, service mesh, AI, and more. I have delivered 5 KubeCon talks and 3 KCD talks, and have served on the KubeCon program committee 4 times. I have incubated 6 cloud-native open-source projects, including one CNCF Sandbox. Over the past three years, I ranked 28th globally in CNCF developer activity contributions.

Badges

eBPF Strengthens SR-IOV To Be Powerful

The growth popularity of AI workloads on Kubernetes has driven the demand for high-performance networking solutions. Virtual networking interfaces like SR-IOV, macvlan, and ipvlan stand out to bring RDMA capability to Pods, while they are unable to resolve clusterIP owing to data path reason. This limitation has forced many users to adopt dual CNIs.
Mature eBPF implementations are already employed in popular CNI projects like Cilium and Calico, successfully replacing kube-proxy. By leveraging eBPF within a single CNI, we can enable clusterIP resolution for SR-IOV, macvlan interfaces.
This session will delve into the technical details of implementing eBPF for clusterIP resolution with SR-IOV interfaces, and prove high forwarding performance compared with projects such as calico and cilium. While combining eBPF with SR-IOV is not yet a mainstream practice, its potential for performance optimization and simplified network management is significant.

Per-node Api-server Proxy: Expand The Cluster's Scale and Stability

For lots of CNCF projects, kinds of daemonsets simultaneously synchronize datas from the Api-server from each node. Especially in large-scale clusters, it creates significant pressure on the Api-server, burdens the network, even affects the stability of the cluster.
Some projects have implemented optimization to address this. For instance, Cilium aggregates endpoint information into the CRD CiliumEndpointSlice before distributing it to its daemonset. However, many projects have not yet adopted such data aggregation optimizations and Currently, there is still no project to help improve the communication between all components and the Api-server.
ClusterPedia supports to launch per-node Api-server proxies to serve all local pods, and utilize eBPF to resolve the API server's clusterIP to the local proxy, which transparently implements API server access redirection on demand. In large-scale clusters, this can significantly improve the stability of all cluster's services.

an innovative network solution of underlay CNI

The underlay CNI solution can meet the needs of low-latency applications, some traditional host applications transporting to the cloud, and storage components requiring independent network to transmit datas.
Currently, there are not many options in the open source community. CNI projects such as macvlan SRIOV ipvlan miss functions like IPAM, clusterip communication. Projects like multus can insert multiple network interface to pod, problems such as routing tuning and IP allocation among multiple network interfaces cannot be solved. There is no IPAM projects in the community to support complicated requirements of underlay IPAM, such as IP dual-stack support, fixed IP for application, IP recycling, IP conflicts, etc..
Spiderpool give IPAM solution, and could integrate macvlan, SRIOV, ipvlan, multus to solve above-mentioned pain points and well integrate various infrastructure resources. It results to a new Underlay network solution.

Cloud Native Networking For AI : Strengthen CNI For RDMA

With the popularity of running AI workloads on kubernetes, RDMA technology with high network performance and CPU offload features helps save the training time. Only SRIOV, macvlan, and ipvlan technologies can effectively demonstrate RDMA capabilities, but in practice with Nvidia's network-operator, there is many problems,for example, unable to access clusterIP, lacking stable IPAM, and unable to manage the routing of multiple network cards effectively.For simultaneously accommodating the network requirements of both AI and traditional applications, in many practices, it has to implement dual CNIs. This presentation provide a solution to make macvlan and SR-IOV satisfy all needs with single CNI.

【Sponsor】RDMA 容器网络下的大规模AI训练探索

随着Chat-GPT点燃了人们对于AI的热情和期待，各个厂商、开发者们开始日益关注AI领域，大模型的一次训练可能需要耗费数十天，因此，解决好在AI训练过程中的算力调度和网络通信，能提高训练的成功率，降低训练时间。本次分享中，首先介绍了大规模训练情况下可能会遇到的网络问题，其次，通过使用提供 RDMA 能力的 spiderpool kubernetes CNI插件，可以满足在各种大规模多机多卡分布式训练的 underlay 和 RDMA 网络要求，友好的支持ROCE 和 InfiniBand 场景，大大提高了分布式训练的通信效率。最后，我们探索了大模型在多个集群上训练的可行性，介绍了我们的多集群训练和多集群网络的方案。

KubeCon + CloudNativeCon North America 2024 Sessionize Event

November 2024 Salt Lake City, Utah, United States

KCD Shanghai 2024 Sessionize Event

April 2024 Shanghai, China

CNCF-hosted Co-located Events Europe 2024 Sessionize Event

March 2024 Paris, France

KubeCon + CloudNativeCon North America 2023 Sessionize Event

November 2023 Chicago, Illinois, United States

Weizhou Lan

daocloud network cloud-native

Shanghai, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Weizhou Lan

Actions

Badges

Sessions

eBPF Strengthens SR-IOV To Be Powerful

Per-node Api-server Proxy: Expand The Cluster's Scale and Stability

an innovative network solution of underlay CNI

Cloud Native Networking For AI : Strengthen CNI For RDMA

【Sponsor】RDMA 容器网络下的大规模AI训练探索

Events

KubeCon + CloudNativeCon North America 2024 Sessionize Event

KCD Shanghai 2024 Sessionize Event

CNCF-hosted Co-located Events Europe 2024 Sessionize Event

KubeCon + CloudNativeCon North America 2023 Sessionize Event

Weizhou Lan

Actions