Session

Enhancing Reliability and Fault-Tolerance Testing in Kubernetes Using KWOK

Kubernetes has emerged as a popular platform for running AI workloads with GPUs. As a result, enhancing reliability has become increasingly important. This talk will demonstrate how the popular Kubernetes testing toolkit KWOK has been enhanced for reliability and fault-tolerance testing.

Shiming Zhang, the creator and maintainer of KWOK, and Yuan Chen from NVIDIA, will outline KWOK's capabilities to simulate and manage a large number of virtual nodes and pods on a laptop, and discuss practical use cases at DaoCloud and NVIDIA.

The session will provide examples and demos, offering a deep dive into KWOK’s latest chaos engineering features, including its ability to simulate failures by introducing targeted fault injections into GPU nodes and pods, thereby facilitating reliability testing, and evaluation of fault-tolerance mechanisms for improving the resilience of AI workloads in Kubernetes.

Attendees will gain practical experience and knowledge about KWOK and its advanced capabilities.

Yuan Chen

Nvidia, Software Engineer, Kubernetes, Scheduling, GPU, AI/ML, Resource Management

San Jose, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top