Session

Profiling with eBPF and Dynamic Process Injection: Optimize AI Workload Performance

As the scale of AI workloads rapidly increases, traditional GPU monitoring solutions can only provide limited observability, and effective full-stack analysis and optimization of AI workload performance is a significant challenge.

In this presentation, we will introduce how to leverage eBPF technology and dynamic process injection techniques to implement a non-intrusive, low-overhead online profiling solution for AI workloads. We will demonstrate how this profiling mechanism captures data from AI frameworks, GPU kernel functions, GPU library calls, system calls, and CPU contexts, presenting it as a visual timeline representation. Based on these visual results, we can conduct fine-grained analyses of bottlenecks and potential issues, such as slow training iterations, GPU data transfer delays, NCCL hangs, and CPU overloads. Additionally, we will share the practical effects of implementing this profiling solution for AI workloads, aiming to enhance training and inference performance.

Zhixin Huo

Alibaba Cloud Intelligence, Senior Software Engineer

Beijing, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top