Accelerate Training Performance with torchrun's Native NUMA Binding

The rapid growth in demand for accelerated hardware like GPUs is outpacing supply, limiting deep learning researchers’ access. At the same time, much of this valuable hardware remains underutilized due to insufficient optimization knowledge. To address this gap, we have enhanced PyTorch’s torchrun with a NUMA Binding (--numa_binding) option that automates CPU core assignments to ranks based on hardware topology. This feature boosts training performance and simplifies complex configuration processes.

The NUMA Binding feature includes four binding strategies: node, exclusive, core-complex, and socket. Each binding strategy is designed to enhance workload performance for specific hardware configurations by assessing system topology and adapting to diverse architectures. In our tests with the nnU-Net model, we observed a mean throughput improvement of 11–13%, depending on the scaling and binding strategy used.

In this talk, we’ll showcase live examples and demonstrate the impact of NUMA Binding on distributed training workflows. Attendees will learn how to leverage this feature for significant training performance gains using their existing infrastructure.

Arpit Singh (SW-CLOUD) US

Senior Software Engineer Nvidia

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Accelerate Training Performance with torchrun's Native NUMA Binding

Arpit Singh (SW-CLOUD) US

Links

Actions