Session
Taming GPU Chaos: Practical Kubernetes Policies for ML Workloads
Your Kubernetes cluster was humming along fine until the ML team showed up. Suddenly one training job is eating every GPU in the cluster, inference pods are getting OOMKilled, and nobody knows which team's model is costing what. This talk focuses on one specific, painful problem: how platform teams can enforce fair, safe resource governance for ML workloads on shared Kubernetes infrastructure without becoming bottlenecks. Using Kyverno as the policy engine, we'll walk through real-world patterns including automatic resource quota enforcement for GPU requests so a single runaway training job can't starve production, namespace-level guardrails that give data scientists self-service deployment within safe boundaries, and labeling and annotation policies that make cost attribution and chargeback actually possible. You'll walk away with a simple ML readiness policy bundle you can apply to your cluster on Monday. No giant platform diagrams. No "just build an internal developer platform" hand-waving. Just practical policy-as-code patterns that solve the most common GPU scheduling headaches.
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top