Session
Inference Your LLMs on the Fly: Serverless Cloud Run with GPU Acceleration
This session dives into the exciting world of deploying and running large language models (LLMs) like Google Gemma and other open source models in a serverless environment. We'll explore the benefits of using Google Cloud Run with GPU acceleration for efficient and scalable LLM inference.
Discover how to containerize your LLM and deploy it to Cloud Run, leveraging the power of GPUs for faster processing and lower latency. Learn how to optimize your LLM for efficient inference on Cloud Run, including model quantization and efficient batching techniques.
Join us to gain practical insights and learn how to seamlessly deploy and scale your LLMs for real-world applications, all while enjoying the cost-effectiveness and ease of management offered by serverless computing.
Jochen Kirstätter
The only frontiers are in your mind | GDE Cloud | Microsoft MVP
Port Louis, Mauritius
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top