Online Profiling and Analysing GPU and AI jobs with eBPF and Kubernetes

Nowadays, most of the AI workloads run in Kubernetes. While there are mature GPU scheduling and monitoring solutions, it's still challenge to analysis and optimize the performance of AI workloads efficiently. In Alibaba Cloud, we use eBPF technology to trace the running AI pods, construct a unified timeline with multiple metrics dimensions, including CPU calls, OS system calls, GPU driver library calls, CUDA calls, Pytorch and Python methods invocation. We achieve comprehensive observation of AI tasks in realtime. We utilize CUDA interception technique to integrate GPU kernel execution details into the overall timeline. A profiling pod can be attached to the AI job without intrusion on demand. As a result, it can perform fine-grained analyses of bottleneck and potential issues like slow training iteration, GPU data transfer delay, NCCL hangs, or CPU overloaded. The online profiling solution ultimately identifies the optimization directions for GPU and AI tasks in Kubernetes cluster.

Zhixin Huo

Alibaba Cloud Intelligence, Senior Software Engineer

Beijing, China

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Online Profiling and Analysing GPU and AI jobs with eBPF and Kubernetes

Zhixin Huo

Links

Actions