Session

Stop Agents from Lying: Zero-Shot Hallucination Detection and Safety Drift Monitoring

Agents confidently fabricate facts while passing binary tests. Research shows standard metrics miss 65-93% of safety violations (AgentDrift, March 2026). Agents hallucinate amenities not in search results, drift from safe to harmful recommendations across conversation turns. Zero-shot hallucination detection identifies fabricated facts without training data. Linear Semantic Consistency (LSC, Oct 2025) achieves 84.6% AUROC by probing model internal states - training-free, works across model families. VISTA framework provides claim decomposition: breaking responses into atomic statements, verifying each independently. High precision (88.4%) catches real hallucinations. You'll learn when to use LSC (batch, low latency), claim decomposition (per-claim granularity), or LLM-judge (real-time guardrails). Trajectory-level safety monitoring detects behavioral drift. AgentDrift (March 2026) tested 1,200 multi-turn conversations: 65-93% of violations occur mid-conversation, not at task boundaries. Example: agent recommends legal strategy (turn 1) → gray-area optimization (turn 3) → tax evasion (turn 5). Binary metrics see "success." Trajectory analysis sees drift. Implement per-turn safety scoring, track degradation, flag when safety drops >0.3. StepShield (Jan 2026) formalizes step-wise risk scoring. Real-time guardrails block unsafe outputs using lifecycle hooks. When output fails thresholds, replace with safe fallback. Latency overhead: 120ms per turn. D3-Gym (April 2026) shows 87.5% agreement with human annotation when rubrics are explicit. Walk away with zero-shot detection, per-turn monitoring, real-time blocking, and cloud observability patterns.

Elizabeth Fuentes Leone

Developer Advocate

San Francisco, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top