To Conquer 10x workloads on AI infrastructure

With the situation of limited GPU, significant challenges arise to fulfill growing user demands, and diverse business scenarios upon the existing infra.

The traditional AI infra resource management ways struggle to address all the practical concerns: 10x tenants, 10x workloads scenarios, and 10x models serving simultaneous.

To optimize GPU utilization, build a solid foundation for AI innovation, this talk shares technical insights below:

1) How to implement isolation and sharing for 10x resource over-commission for many tenants/users, with `Kueue` / `scheduler-plugin`

2) Prioritize 10x co-located workloads who are competing for resources, and fill their resource compactness , with `HAMi` & `DRA`

3) Faced 10x amounts of models to be served, how to strike a balance between on-demand loading & latency SLO using `Lingo`, and how `ollama-operator` streamlines the serving of multiple models

Please refers to `Additional resources` for further tech-details

Peter Pan

Cloud-Native Developer , Open Source Enthusiast

Shanghai, China

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

To Conquer 10x workloads on AI infrastructure

Peter Pan

Links

Actions