Don't Let Your GPUs Get Lonely: Challenges with Distributed Multi-Node Training at Scale

As the prevalence of AL/ML workloads and larger models increases, proper management of multi-node training jobs becomes more important. It is becoming essential to squeeze the most out of GPUs and the networks that underlie them to maximize performance.

In this unique panel combining experts from cloud providers, hardware vendors, and OSS k8s maintainers, we will start from the basics of Distributed Deep Learning and then dive into how these jobs can be orchestrated in Kubernetes using the NCCL GPU framework. NCCL (NVIDIA Collective Communication Library) is a core technology for multi-node and multi-GPU training for distributed training and network acceleration of large AI models.

Starting from use cases, we will take a look under the hood at why distributed training is becoming a common pattern; how application frameworks like NCCL are interjected into ML frameworks; and how kubernetes can help to orchestrate multi-node training systems and the future in this area.

Kevin Klues

Distinguished Engineer at NVIDIA

Berlin, Germany

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Don't Let Your GPUs Get Lonely: Challenges with Distributed Multi-Node Training at Scale

Kevin Klues

Links

Actions