Session
Navigating AI/ML Workloads in Large-Scale Kubernetes Clusters
Managing AI/ML workloads with GPUs on Kubernetes presents formidable challenges due to complex job management and scheduling, along with the need for substantial specialized computing resources, such as GPUs, which are not readily available.
This talk introduces Knavigator, an open-source framework and toolkit designed to support developers of Kubernetes systems. Knavigator facilitates the development, testing, troubleshooting, benchmarking, chaos engineering, performance analysis, and optimization of AI/ML control planes with GPUs in Kubernetes.
Knavigator enables tests on Kubernetes clusters using both real and virtual GPU nodes, allowing for large-scale testing with limited resources, such as a laptop.
Through real examples and demos, this presentation will showcase Knavigator's capabilities in feature validation, performance, load testing, and reliability testing. It will also highlight how Knavigator enhances the fault tolerance of large model training jobs in Kubernetes.
Yuan Chen
Nvidia, Software Engineer, Kubernetes, Scheduling, GPU, AI/ML, Resource Management
San Jose, California, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top