Efficient online profiling of AI workloads on Kubernetes based on eBPF and dynamic process injection

Nowadays, most AI workloads are running in Kubernetes, and effectively analyzing and optimizing the performance of large-scale AI workloads remains a significant challenge. We utilize eBPF technology and process dynamic injection techniques to monitor the Pods running the AI workloads. This approach enables online AI profiling that is transparent, non-intrusive, on-demand, and has low overhead for business Pods. We capture data from various aspects such as AI frameworks, GPU kernel functions, GPU library calls, system calls, and CPU utilization, ultimately summarizing and presenting it as a timeline rendering graph. This allows for fine-grained analysis of bottlenecks and potential issues, such as slow training, GPU data transfer delays, or NCCL hang-ups. Additionally, we support profiling for various types of GPU resources to ensure versatility. The online AI profiling solution ultimately determines the optimization direction for GPUs and AI tasks within the Kubernetes cluster.

Zhixin Huo

Alibaba Cloud Intelligence, Senior Software Engineer

Beijing, China

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Efficient online profiling of AI workloads on Kubernetes based on eBPF and dynamic process injection

Zhixin Huo

Links

Actions