Session

More than Model Sharding: LWS & Distributed Inference

Large LLM like Llama3.1-405B or Deepseek-V3 (671B), require distributed inference across multiple-nodes like vLLM + Ray backend.
However, it's more than just model-slicing with tensor-parallelism, Native K8S treats those workloads across nodes irrelevantly , so challenges come:
- standalone statefulSets without coordination
- demand of Gang-scheduling
- uncontrolled startup order among master & workers, causing boot lag
- HPA as a whole instead of for each sts, to scale together for both Ray head/worker.
- stable index and rank
- topology aware grouping
- failure recovery for vllm/pytorch(not smart enough), to avoid one pod/GPU failure disrupting overall inference

----
So LWS - LeaderWorkerSet (github.com/kubernetes-sigs/lws) , is designed to address them:
- to optimize resource coordination with leader-worker set
- improve performance thru co-location
- integrate scaling with HPA for whole lws together
- all-or-nothing restart policy to fault tolerance as a group.

Nicole Li

Cloud Native Developer at DaoCloud

Shanghai, China

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top