Session
Inferencing LLMs in production with Kubernetes and KubeFlow
Large Language Models (LLMs) are powerful but deploying them reliably, cost-effectively, and at scale in production is a different challenge altogether. In this session, we’ll walk through how to operationalize LLM inference using Kubeflow on Kubernetes, leveraging open-source and cloud-native tools to build resilient, scalable, and observable GenAI infrastructure.
We'll cover:
Architecture for serving LLMs using KServe or vLLM on GKE
Strategies for optimizing latency, GPU usage, and autoscaling with KEDA, Triton, or Multi-Model Serving
Integrating Kubeflow Pipelines for pre/post-processing, caching, and batch inference workflows
Chamod Perera
Software Engineer II @ Circles | CNCF Ambassador | GDG Organizer
Colombo, Sri Lanka
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top