Emmanuel Acheampong's Speaker Profile @ Sessionize

Inference in Production: Engineering LLM Serving for Latency, Throughput, and Reliability

Inference looks simple from the outside: send a prompt, get a response. In production, it becomes a systems engineering problem.

Latency spikes under burst traffic. Throughput stalls despite adding GPUs. Tail latency explodes from batching and scheduling dynamics. Teams spend months rediscovering the same bottlenecks around KV cache pressure, autoscaling lag, model warmup, and GPU utilization.

This workshop is presented by Crusoe engineers who work directly on the inference systems powering customer workloads on Crusoe Cloud. In 50 minutes, we’ll break down how modern LLM inference actually works, why production serving is far harder than most teams expect, and the infrastructure patterns required to deliver reliable low-latency inference at scale.

Topic includes:

1. The mechanics of inference: tokenization, prefill vs. decode, KV cache behavior, and the real drivers of latency and throughput.

2. Why serving LLMs is difficult in practice: batching tradeoffs, memory pressure, head-of-line blocking, autoscaling behavior, and tail-latency management.

3. How Crusoe engineers its inference stack for low time-to-first-token, sustained throughput, and predictable performance under load.

4. Production case studies of leveraging open-source LLM infrastructure.

Attendees will leave with a systems-level mental model of inference, practical evaluation criteria for inference providers, and concrete operational patterns they can apply to their own deployments.

No Single Model to Rule Them All: Building Resilient AI Agents Across Open & Closed LLMs

The era of betting everything on a single LLM is over. Developers building production AI agents face a reality no model vendor wants to talk about: no one model excels at every task, no single API guarantees 100% uptime, and no proprietary provider offers the cost profile that works for every layer of an agentic pipeline.

The open-source LLM ecosystem has changed the equation. Llama 3.3, DeepSeek-R1, Qwen3, Gemma 3, Kimi-K2 — these models are not fallback options. They are, for many agentic workloads, the better choice on quality, latency, cost, or all three. But the real power is not in picking one winner. It is in architecting agents that route across multiple models, failover when an endpoint goes down, and match model strengths to task requirements in real time.

Resilient agentic engineering demands a multi-model, multi-provider architecture — and the neocloud is built for exactly this. Crusoe Managed AI provides a single API surface across every major open-source LLM, on infrastructure purpose-built for the throughput and latency demands of agentic workloads.

This session draws from production experience to walk through the architecture decisions, failure modes, and performance tradeoffs of moving from a single-model prototype to a resilient, multi-model agent in production.

No Single Model to Rule Them All: Building Resilient AI Agents Across Open & Closed LLMs

The era of betting everything on a single LLM is over. Developers building production AI agents face a reality no model vendor wants to talk about: no one model excels at every task, no single API guarantees 100% uptime, and no proprietary provider offers the cost profile that works for every layer of an agentic pipeline.

The open-source LLM ecosystem has changed the equation. Llama 3.3, DeepSeek-R1, Qwen3, Gemma 3, Kimi-K2 — these models are not fallback options. They are, for many agentic workloads, the better choice on quality, latency, cost, or all three. But the real power is not in picking one winner. It is in architecting agents that route across multiple models, failover when an endpoint goes down, and match model strengths to task requirements in real time.

Resilient agentic engineering demands a multi-model, multi-provider architecture — and the neocloud is built for exactly this. Crusoe Managed AI provides a single API surface across every major open-source LLM, on infrastructure purpose-built for the throughput and latency demands of agentic workloads.

This session draws from production experience to walk through the architecture decisions, failure modes, and performance tradeoffs of moving from a single-model prototype to a resilient, multi-model agent in production.

No Single Model to Rule Them All: Building Resilient AI Agents Across Open & Closed LLMs

AI agents are only as reliable as the models behind them. Most teams start by wiring an agent to a single LLM and calling it done. Then reality hits: rate limits, outages, cost spikes, and tasks where one model underperforms another. The teams building resilient agents in production aren't betting on one model. They're building across many.
This talk covers how to architect AI agents that route intelligently across open and closed LLMs. I'll walk through practical patterns for model selection at inference time: when to use a large frontier model versus a fine-tuned open-weight model, how to build fallback chains that maintain agent quality during provider outages, and how to use routing logic to optimize for cost, latency, and task-specific accuracy.
Using PyTorch ecosystem tools like vLLM for self-hosted open models alongside closed API providers, I'll show how teams are deploying agent systems that aren't locked into any single vendor or architecture. We'll look at real tradeoffs between dense and MoE open models for different agent subtasks, and why the most resilient agent architectures treat model selection as a runtime decision, not a design-time one.

Building Computer Vision AI algorithms for 100 skin shades

roboMUA is leveraging AI to build Computer Vision models for the beauty and fashion industry for over 100 skin shades.

In this session, I’ll discuss how we gathered our data, trained our ur models and deployed models that took 100 skin shades into consideration in order to be inclusive.

Speaker

Emmanuel Acheampong

Actions

Links

Area of Expertise

Topics

Sessions

Inference in Production: Engineering LLM Serving for Latency, Throughput, and Reliability

No Single Model to Rule Them All: Building Resilient AI Agents Across Open & Closed LLMs

No Single Model to Rule Them All: Building Resilient AI Agents Across Open & Closed LLMs

No Single Model to Rule Them All: Building Resilient AI Agents Across Open & Closed LLMs

Building Computer Vision AI algorithms for 100 skin shades

Emmanuel Acheampong

Links

Actions