Session
Serve Optimize: Energy Aware Configuration Search for PyTorch LLM Inference
LLM inference is often tuned for maximum throughput or lowest latency, but those settings can waste power or miss better deployment tradeoffs. This talk presents Serve Optimize, an open-source optimizer for PyTorch ecosystem serving stacks, starting with vLLM, that searches for energy aware LLM serving configurations.
Serve Optimize detects the available GPU or MIG slice, generates feasible backend candidates, launches or attaches to OpenAI-compatible inference servers, runs controlled workloads, collects power telemetry, and ranks configurations using throughput, p95 latency, average watts, joules/token, and tokens/watt. It then builds a Pareto frontier and recommends an operating point for a selected goal: maximum throughput, lowest latency, lowest energy per token, best tokens/watt, or balanced performance.
The session will cover the architecture, measurement methodology, candidate search strategy, and a case study on workstation and MIG based GPUs. Attendees will leave with a practical pattern for making LLM inference tuning reproducible, measurable, and power aware rather than relying on benchmark defaults.
Sai Sravan Cherukuri
Open Source Enthusiasts and DevSecOps Architect
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top