Session
Locking the Monster: Strategies to Isolate Resource Big Eaters
For Kubernetes containers on the same node, they may compete for crucial resources such as CPU, memory, network, disk, kernel parameters, GPU, and others.
Although we are not defenseless: Kubernetes QoS , Quota and GC mechanism can restrict most potential problems.
But for some other cases, pods may be able to break through container isolation walls (consciously or unconsciously), becoming disruptive neighbors, causing performance degradation, even node failures: examples: Pods eat up shared kernel resources( pid, fs.inotify), network resources(tcp_max_tw_buckets), overconsumption ..etc
when goes to AI/LLM workloads, GPU contention is another main issue, as well as pod heavy stress on IO(gradient aggregation, checkpoint saving, dataset loading)
This talk shares cases of resource-intensive pods and resource contention, then seek mitigation solutions, to minimize the impact of disruptive neighbors, enhance resource utilization, and prevent node failures.
Peter Pan
Cloud-Native Developer , Open Source Enthusiast
Shanghai, China
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top