Running NVIDIA GB200 with DRA on Kubernetes in the Cloud: Experiences and Lessons

The NVIDIA GB200 system enables GPUs to communicate directly over high-bandwidth NVLink, transforming a rack of 18 GPU nodes into a single supercomputer with up to 72 GPUs. To support such cross-node resource sharing through GB200’s IMEX channels, the Kubernetes community has introduced new Dynamic Resource Allocation (DRA) features, including ComputeDomain, which provide the primitives to represent and allocate complex multi-node GPU resources on GB200.

However, because both GB200 and DRA are new, deploying and using them in Kubernetes presents unique challenges.

In this talk, we’ll share our experiences, including challenges and lessons learned, from operating NVIDIA GB200 with DRA in production Kubernetes clusters across AWS, GCP, and OCI, highlighting both common and cloud-specific issues. We’ll walk through an example of launching a multi-node MPI job via DRA, demonstrating how it leverages GB200’s IMEX channels and NVLink to run large-scale multi-node workloads.

Yuan Chen

Nvidia, Software Engineer, Kubernetes, GPU, AI/ML Infrastructure

San Jose, California, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Running NVIDIA GB200 with DRA on Kubernetes in the Cloud: Experiences and Lessons

Yuan Chen

Links

Actions