Taming GPU Chaos: Practical Kubernetes Policies for ML Workloads

Your Kubernetes cluster was humming along fine until the ML team showed up. Suddenly one training job is eating every GPU in the cluster, inference pods are getting OOMKilled, and nobody knows which team's model is costing what. This talk focuses on one specific, painful problem: how platform teams can enforce fair, safe resource governance for ML workloads on shared Kubernetes infrastructure without becoming bottlenecks. Using Kyverno as the policy engine, we'll walk through real-world patterns including automatic resource quota enforcement for GPU requests so a single runaway training job can't starve production, namespace-level guardrails that give data scientists self-service deployment within safe boundaries, and labeling and annotation policies that make cost attribution and chargeback actually possible. You'll walk away with a simple ML readiness policy bundle you can apply to your cluster on Monday. No giant platform diagrams. No "just build an internal developer platform" hand-waving. Just practical policy-as-code patterns that solve the most common GPU scheduling headaches.

Pratik Mahalle

DevRel

Pune, India

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Taming GPU Chaos: Practical Kubernetes Policies for ML Workloads

Pratik Mahalle

Links

Actions