From Lag to Lightning: Turbocharging Kubernetes Job Controllers for Massive Scale

Kubernetes controllers form core building blocks for any ML platform to manage jobs. If not designed for scale, controllers become major bottlenecks. This talk covers a case study of a job controller causing significant slowdowns after 250+ job submissions, leading to dropped events in job controller and hanging job submission and update processes
We approached the problem by simulating what happens when a large number of jobs all show up at once. As part of this, we performed memory profiling, lock contention profiling, workqueue management, and analysed the event handling path in the Go client library. It was found that unmanaged global locks, combined with a low number of workers for the workqueue, heavy lifting in the event handling path, and default QPS-burst settings can result in overall system slowdown.

In this lightning talk, I present our comprehensive investigation, observations, bottlenecks, tools, proposed solutions, and benchmarking results

Arpit Singh (SW-CLOUD) US

Senior Software Engineer Nvidia

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

From Lag to Lightning: Turbocharging Kubernetes Job Controllers for Massive Scale

Arpit Singh (SW-CLOUD) US

Links

Actions