Session
Energy Aware LLM Serving: Open Source Configuration Search for Faster, Cheaper, and Green Inference
Open source LLM serving has become fast, but not always efficient. Teams often tune vLLM, SGLang, TensorRT-LLM, and OpenAI-compatible servers for maximum tokens/sec, then discover later that the chosen configuration wastes power, raises cost, or misses a better deployment tradeoff.
This session presents Serve Optimize, an open source approach to energy aware LLM inference tuning. It detects the available GPU or MIG slice, generates feasible serving candidates, runs controlled workloads, collects power telemetry, and compares configurations using tokens/sec, p95 latency, average watts, joules/token, and tokens/watt. Instead of producing another benchmark table, it builds a Pareto frontier and recommends an operating point for the user’s goal: maximum throughput, lowest latency, lowest energy per token, best tokens/watt, or balanced performance.
The talk will focus on the reusable open source design pattern: hardware detection, candidate pruning, benchmark orchestration, telemetry, reproducible artifacts, and goal-aware recommendations. Attendees will leave with a practical method for making self-hosted AI inference measurable, repeatable, and power-aware.
Sai Sravan Cherukuri
Open Source Enthusiasts and DevSecOps Architect
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top