Session

krun - Multi-Node Launcher for Deep Learning Workloads

krun (kubernetes-run) is a multi-node application launcher utility on Kubernetes-based Platforms that supports PYTORCH jobs. It is designed to launch distributed DL workloads across single/multi-nodes and easing the launch process for the user.

The primary benefits of krun over other existing launchers (mpirun & srun) are the following:
- Removes dependency on mpirun in the container image.
- Provides srun equivalence to allow users to easily migrate jobs between Slurm and Kubernetes based clusters.
- Tight Integration with the PYTORCH framework and ability to extend its capabilities for a platform.

krun enables the PYTORCH based Deep Learning Workloads like LLM to achieve peak performance by performing NUMA Binding on GPUs. The NUMA Binding feature eliminates the need for the user to know the topology information and requirement of hardware tuning. It provides the users a mechanism to bind their job rank-processes to cpu-cores. For the BERT workload, NUMA binding provides performance gains of about 0.6% and for SSD the performance improvement is around 2.5%.

Arpit Singh (SW-CLOUD) US

Senior Software Engineer Nvidia

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top