Speaker

Chen Wang

Chen Wang

IBM, Senior Research Scientist

Chappaqua, New York, United States

Actions

Chen Wang is a Senior Research Scientist at the IBM T.J. Watson Research Center. Her interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI & LLM systems, and applying AI in Cloud system management. She is an open-source advocate, a Kubernetes & CNCF contributor, and a KubeCon speaker. She obtained an MS and a Ph.D. in Electrical & Computer Engineering from Carnegie Mellon University (CMU).

Area of Expertise

  • Information & Communications Technology

Topics

  • AI
  • LLMs
  • Kubernetes
  • sustainability
  • Model Serving

Climatik: Cloud Native Sustainable LLM via Power Capping

As GenAI workloads grow, the need for advanced accelerators with higher power consumption is surging. NVIDIA GPU peak power has risen from 300W for V100 to 1000W for B100. However, current power infrastructure and cooling systems are not designed to handle rapid power increases, leading to challenges like limited accelerator deployment in some regions or overheating risks that could cause fire hazards. We propose Climatik, a dynamic power capping system that enables data center and cluster admins and developers to set power caps dynamically at the cluster, service namespace, and rack levels. Climatik leverages Kepler for observability and offers APIs for integration with Kubernetes control knobs, including autoscalers, schedulers, and queuing systems, to ensure power caps are maintained across all levels. We will demo how to use Climatik to configure power capping for a large language model (LLM) inference service on KServe and show how power capping influences KEDA on autoscaling.

Best practices for LLM serving with DRA

In the rapidly evolving landscape of Large Language Models (LLMs), overcoming low GPU cluster utilization (as low as 20-30% in traditional setups) is crucial for efficiently serving these models in Kubernetes. This talk will share insights from deploying and serving LLMs using MIG partitions and dynamic resource allocation (DRA). Our experiments discovered that the optimal GPU MIG partition size depends on the specific LLM model and its load, highlighting the necessity and feasibility of using Dynamic Resource Allocation (DRA) for dynamically scaling model-serving instances vertically.

We'll showcase deploying the open-source vLLM framework in Kubernetes, focusing on scaling vLLM instances for increased loads while maximizing GPU utilization. Attendees will gain practical knowledge on selecting effective MIG partitions for different models and using DRA to optimize their model-serving systems.

Trimaran: Load-Aware Scheduling for Power Efficiency and Performance Stability

If you're experiencing a cluster where some nodes are stubbornly congested and others are not, or some nodes are spiky in their utilization, or some pods are able to burst freely yet others are not, then you may need to use a Trimaran scheduler. In this talk, we will provide an overview of the Trimaran scheduler plugins and demonstrate their utility. Basically, Trimaran plugins are load-aware schedulers which place pods on nodes based on actual measured node resource utilization, while considering requests and limits specifications of resources. Having utilization as an objective helps (1) minimize power consumption by targeting an optimal range of utilization, (2) avoid congestion and interference among multiple containers running on the same node, and (3) lower the risk of over-commitment when containers burst their usage to the specified limits.

Cloud Native Sustainable LLM Inference in Action

Join our tutorial on sustainable Large Language Models (LLM) inference using cloud-native tech. We'll cover LLMs, energy use, and Kepler's role in monitoring power during LLM workloads. Learn about balancing environmental sustainability and tech efficiency, using AI accelerator frequency adjustments in Cloud Native tech for optimized LLM inference. This ensures power efficiency and cost-effectiveness.

Experience a live demo of vLLM, an advanced inference framework, in action. See how we tweak AI accelerator settings in a Kubernetes cluster for ideal power-computation balance.

This tutorial is a must-attend for professionals keen on integrating environmental sustainability with cloud-native technology solutions. Whether you're a developer, an IT specialist, or a sustainability advocate, you'll gain valuable insights into the future of eco-friendly cloud computing. Join us to be at the forefront of this significant technological evolution.

KubeCon + CloudNativeCon North America 2024 Sessionize Event

November 2024 Salt Lake City, Utah, United States

KubeCon + CloudNativeCon Europe 2024 Sessionize Event

March 2024 Paris, France

CNCF-hosted Co-located Events Europe 2024 Sessionize Event

March 2024 Paris, France

Chen Wang

IBM, Senior Research Scientist

Chappaqua, New York, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top