Inferencing LLMs in production with Kubernetes and KubeFlow

Large Language Models (LLMs) are powerful but deploying them reliably, cost-effectively, and at scale in production is a different challenge altogether. In this session, we’ll walk through how to operationalize LLM inference using Kubeflow on Kubernetes, leveraging open-source and cloud-native tools to build resilient, scalable, and observable GenAI infrastructure.

We'll cover:

Architecture for serving LLMs using KServe or vLLM on GKE

Strategies for optimizing latency, GPU usage, and autoscaling with KEDA, Triton, or Multi-Model Serving

Integrating Kubeflow Pipelines for pre/post-processing, caching, and batch inference workflows

Chamod Perera

Software Engineer II @ Circles | CNCF Ambassador | GDG Organizer

Colombo, Sri Lanka

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Inferencing LLMs in production with Kubernetes and KubeFlow

Chamod Perera

Links

Actions