Session

Inference in Production: Engineering LLM Serving for Latency, Throughput, and Reliability

Inference looks simple from the outside: send a prompt, get a response. In production, it becomes a systems engineering problem.

Latency spikes under burst traffic. Throughput stalls despite adding GPUs. Tail latency explodes from batching and scheduling dynamics. Teams spend months rediscovering the same bottlenecks around KV cache pressure, autoscaling lag, model warmup, and GPU utilization.

This workshop is presented by Crusoe engineers who work directly on the inference systems powering customer workloads on Crusoe Cloud. In 50 minutes, we’ll break down how modern LLM inference actually works, why production serving is far harder than most teams expect, and the infrastructure patterns required to deliver reliable low-latency inference at scale.

Topic includes:

1. The mechanics of inference: tokenization, prefill vs. decode, KV cache behavior, and the real drivers of latency and throughput.

2. Why serving LLMs is difficult in practice: batching tradeoffs, memory pressure, head-of-line blocking, autoscaling behavior, and tail-latency management.

3. How Crusoe engineers its inference stack for low time-to-first-token, sustained throughput, and predictable performance under load.

4. Production case studies of leveraging open-source LLM infrastructure.

Attendees will leave with a systems-level mental model of inference, practical evaluation criteria for inference providers, and concrete operational patterns they can apply to their own deployments.

Emmanuel Acheampong

Senior Manager Developer Relations at Crusoe

San Francisco, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top