Session
Profiling with eBPF and Dynamic Process Injection: Optimize AI Workload Performance
As the scale of AI workloads rapidly increases, traditional GPU monitoring solutions can only provide limited observability, and effective full-stack analysis and optimization of AI workload performance is a significant challenge.
In this presentation, we will introduce how to leverage eBPF technology and dynamic process injection techniques to implement a non-intrusive, low-overhead online profiling solution for AI workloads. We will demonstrate how this profiling mechanism captures data from AI frameworks, GPU kernel functions, GPU library calls, system calls, and CPU contexts, presenting it as a visual timeline representation. Based on these visual results, we can conduct fine-grained analyses of bottlenecks and potential issues, such as slow training iterations, GPU data transfer delays, NCCL hangs, and CPU overloads. Additionally, we will share the practical effects of implementing this profiling solution for AI workloads, aiming to enhance training and inference performance.
Zhixin Huo
Alibaba Cloud Intelligence, Senior Software Engineer
Beijing, China
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top