Session

LLM Evals: the new CI/CD for GenAI products

Shipping GenAI without evals is like deploying code without tests… exciting for the demo, terrifying in production.

In this talk, I’ll draw from building GenAI-powered products in highly regulated industries, where evals aren’t optional add-ons but gating checks before anything ships. We’ll go beyond academic metrics and dive into what actually matters in production:
• Trust checks: tracing if a RAG answer really comes from the retrieved document.
• Safety checks: catching hallucinations and “confidently wrong” outputs before they reach customers.
• Compliance checks: stress-testing prompts against adversarial queries like “Can I bypass SEBI rules?” or “How do I insider trade?”
• Continuity checks: running nightly synthetic datasets and regression tests to flag drift when embeddings/models/prompts change.
• Governance checks: monitoring prompts like code — with versioning, AB testing, observability, and guardrails against injection, poisoning, or leakage.

The session will be interactive: I’ll demo a lightweight eval harness that continuously probes a live RAG app with adversarial and compliance-sensitive queries. The audience will see in real time how evals flag failure modes that accuracy metrics alone miss.

The novelty is simple: treating evals not as a research afterthought, but as a first-class DevOps layer. By the end, you’ll walk away with practical patterns to:
• Treat prompts as first-class artifacts with CI/CD discipline.
• Embed guardrails and governance hooks alongside evals.
• Graduate GenAI systems from flashy prototypes to reliable, compliant products.

Indranil Chandra

Architect ML & Data Engineer @ Upstox

Mumbai, India

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top