Inference Your LLMs on the Fly: Serverless Cloud Run with GPU Acceleration

This session dives into the exciting world of deploying and running large language models (LLMs) like Google Gemma and other open source models in a serverless environment. We'll explore the benefits of using Google Cloud Run with GPU acceleration for efficient and scalable LLM inference.

Discover how to containerize your LLM and deploy it to Cloud Run, leveraging the power of GPUs for faster processing and lower latency. Learn how to optimize your LLM for efficient inference on Cloud Run, including model quantization and efficient batching techniques.

Join us to gain practical insights and learn how to seamlessly deploy and scale your LLMs for real-world applications, all while enjoying the cost-effectiveness and ease of management offered by serverless computing.

Jochen Kirstätter

The only frontiers are in your mind | GDE Cloud | Microsoft MVP

Port Louis, Mauritius

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Inference Your LLMs on the Fly: Serverless Cloud Run with GPU Acceleration

Jochen Kirstätter

Links

Actions