Arpit Singh (SW-CLOUD) US

Senior Software Engineer Nvidia

Actions

Arpit Singh specializes in AI infrastructure at Nvidia, enhancing deep learning applications. Besides being a Kubernetes contributor, Arpit has 10+ years of experience spanning Nvidia, Nutanix and Cisco. He holds multiple patents (2 granted, 4+ pending) and has dual master's degrees

Accelerate Training Performance with torchrun's Native NUMA Binding

The rapid growth in demand for accelerated hardware like GPUs is outpacing supply, limiting deep learning researchers’ access. At the same time, much of this valuable hardware remains underutilized due to insufficient optimization knowledge. To address this gap, we have enhanced PyTorch’s torchrun with a NUMA Binding (--numa_binding) option that automates CPU core assignments to ranks based on hardware topology. This feature boosts training performance and simplifies complex configuration processes.

The NUMA Binding feature includes four binding strategies: node, exclusive, core-complex, and socket. Each binding strategy is designed to enhance workload performance for specific hardware configurations by assessing system topology and adapting to diverse architectures. In our tests with the nnU-Net model, we observed a mean throughput improvement of 11–13%, depending on the scaling and binding strategy used.

In this talk, we’ll showcase live examples and demonstrate the impact of NUMA Binding on distributed training workflows. Attendees will learn how to leverage this feature for significant training performance gains using their existing infrastructure.

krun - Multi-Node Launcher for Deep Learning Workloads in Kubernetes

krun (Kubernetes-run) is a multi-node application launcher for Kubernetes, designed to run large-scale deep learning jobs. It supports frameworks like PyTorch & MPI, and abstracts the complexities of the underlying infrastructure, eliminating the need for users to construct complex launch commands.
This talk will show how users can effortlessly pass training scripts alongside the launcher executable, easily setting up distributed launch commands and the required environment. A live demo will highlight the launcher's ability to run large language models (LLM). We will also demonstrate performance improvements through the integrated NUMA Binding, optimizing resource allocation for faster training.
krun offers two key advantages: it simplifies the process of launching deep learning workloads and also helps in boosting training performance. These benefits empower researchers and data scientists to focus on their core work, accelerating AI development in Kubernetes environments.

krun - Multi-Node Launcher for Deep Learning Workloads

krun (kubernetes-run) is a multi-node application launcher utility on Kubernetes-based Platforms that supports PYTORCH jobs. It is designed to launch distributed DL workloads across single/multi-nodes and easing the launch process for the user.

The primary benefits of krun over other existing launchers (mpirun & srun) are the following:
- Removes dependency on mpirun in the container image.
- Provides srun equivalence to allow users to easily migrate jobs between Slurm and Kubernetes based clusters.
- Tight Integration with the PYTORCH framework and ability to extend its capabilities for a platform.

krun enables the PYTORCH based Deep Learning Workloads like LLM to achieve peak performance by performing NUMA Binding on GPUs. The NUMA Binding feature eliminates the need for the user to know the topology information and requirement of hardware tuning. It provides the users a mechanism to bind their job rank-processes to cpu-cores. For the BERT workload, NUMA binding provides performance gains of about 0.6% and for SSD the performance improvement is around 2.5%.

Enabling Fault Tolerance for GPU Accelerated AI workloads in Kubernetes

In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs.

Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.

From Lag to Lightning: Turbocharging Kubernetes Job Controllers for Massive Scale

Kubernetes controllers form core building blocks for any ML platform to manage jobs. If not designed for scale, controllers become major bottlenecks. This talk covers a case study of a job controller causing significant slowdowns after 250+ job submissions, leading to dropped events in job controller and hanging job submission and update processes
We approached the problem by simulating what happens when a large number of jobs all show up at once. As part of this, we performed memory profiling, lock contention profiling, workqueue management, and analysed the event handling path in the Go client library. It was found that unmanaged global locks, combined with a low number of workers for the workqueue, heavy lifting in the event handling path, and default QPS-burst settings can result in overall system slowdown.

In this lightning talk, I present our comprehensive investigation, observations, bottlenecks, tools, proposed solutions, and benchmarking results

Arpit Singh (SW-CLOUD) US

Senior Software Engineer Nvidia

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Arpit Singh (SW-CLOUD) US

Actions

Links

Sessions

Accelerate Training Performance with torchrun's Native NUMA Binding

krun - Multi-Node Launcher for Deep Learning Workloads in Kubernetes

krun - Multi-Node Launcher for Deep Learning Workloads

Enabling Fault Tolerance for GPU Accelerated AI workloads in Kubernetes

From Lag to Lightning: Turbocharging Kubernetes Job Controllers for Massive Scale

Arpit Singh (SW-CLOUD) US

Links

Actions