Session
Harness Engineering for Production LLMs: Eval-as-Code From Day One
An evaluation harness is the difference between an LLM application you can change with confidence and one that becomes a write-only system. This talk shows how to build reusable evaluation harnesses with lm-evaluation-harness, RAGAS, DeepEval, and Promptfoo — eval-as-code in CI, behavioral regression suites, faithfulness scorecards, and the operational practices that keep eval coverage growing as the application grows.
Takeaways: A reusable harness pattern that works across RAG and agentic systems. CI integration patterns for LLM evaluation. Operational practices for keeping evals current as the system evolves.
Preferred length: 30 min.
Audience: AI engineers, QA engineers, ML practitioners.
Level: Intermediate.
First public delivery: 2026.
Anwar Khan
Production AI Engineering — Agentic AI · MCP · Knowledge RAG · LLM Engineering | Speaker · Author · Mentor
Moline, Illinois, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top