Session
NIM+DRA: Running optimized GenAI models on Kubernetes at scale
NVIDIA NIMs are a set of microservices that accelerate the deployment of GPU-optimized generative AI models. They deliver unmatched NLP capabilities with GPU acceleration via CUDA, TensorRT, and Triton Inference Server.
Each NIM is packaged into a container along with a set of profiles that define which optimized model should be used for a set of available GPUs. For example, 2 optimized models exist for the Llama3 NIM – one for running on a single H100 and one for running across two A100s. Depending on what hardware is available, one model should be used over the other.
We have built an Operator to cache and deploy NVIDIA NIMs at scale, seamlessly integrating with DRA to manage and optimize GPU resources. Combining the NIM Operator and DRA significantly improves GPU scheduling and utilization, enhancing performance, reducing costs, and increasing flexibility. In this talk, we demonstrate the NIM Operator with DRA to showcase substantial improvements in model serving at scale.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top