Session

Sit back and relax with fault awareness and robust instant recovery for large scale AI workloads

The fault tolerance during train, fine-tuning, and even inferencing is crucial to modern AI workloads when it happens on large scale, with loads of GPU clusters.

For training and fine-tuning tasks, failure of GPUs, storages, any hardware issues often cause the extending the training time to weeks and even months significantly. For inferencing, when massive loads of requests income, if one of the inferencing servers went faulty, we need a policy and scheduler to perform mitigation to transfer the workloads fast and efficiently.

In this talk, We will introduce a series of mechanism we have designed to help Kubernetes clusters and workloads itself to locate, diagnostic the root cause, schedule and perform mitigation when it comes to any of hardware or CUDA API call failures to reduce the overall operating challenges. But the possibilities will not stop here, the fault awareness and mitigation scheduler will help any of the workloads to mitigate during failures.

Kebe Liu

DaoCloud, Senior software engineer

Shanghai, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top