Session

When GPU Nodes Misbehave: Common Issues and Fixes in Kubernetes

GPU node failures and errors are among the most common and challenging issues in AI clusters that use GPU accelerators. Operating and managing Kubernetes GPU clusters presents significant challenges due to these failures and errors.

In this talk, we’ll share our experience and lessons learned on troubleshooting the common GPU node issues from operating and managing hundreds of Kubernetes GPU clusters and tens of thousands of GPUs for AI and ML workloads on NVIDIA DGX Cloud.

We’ll cover the most frequent GPU node issues in production, including missing or disappearing GPU devices, common XID errors, and thermal or performance throttling. For each issue, we’ll walk through its symptoms, root causes, mitigations, and resolutions. We’ll also highlight NVIDIA’s ongoing initiatives and open source project efforts to automate the detection and mitigation of these issues, improving the reliability and operational efficiency of Kubernetes GPU clusters.

Yuan Chen

Nvidia, Software Engineer, Kubernetes, GPU, AI/ML Infrastructure

San Jose, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top