Harness Engineering for Production LLMs: Eval-as-Code From Day One

An evaluation harness is the difference between an LLM application you can change with confidence and one that becomes a write-only system. This talk shows how to build reusable evaluation harnesses with lm-evaluation-harness, RAGAS, DeepEval, and Promptfoo — eval-as-code in CI, behavioral regression suites, faithfulness scorecards, and the operational practices that keep eval coverage growing as the application grows.
Takeaways: A reusable harness pattern that works across RAG and agentic systems. CI integration patterns for LLM evaluation. Operational practices for keeping evals current as the system evolves.

Preferred length: 30 min.
Audience: AI engineers, QA engineers, ML practitioners.
Level: Intermediate.
First public delivery: 2026.

Anwar Khan

Production AI Engineering — Agentic AI · MCP · Knowledge RAG · LLM Engineering | Speaker · Author · Mentor

Moline, Illinois, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Harness Engineering for Production LLMs: Eval-as-Code From Day One

Anwar Khan

Links

Actions