Session
Your Agent Lies and Passes Every Test
Your agent returns a confident answer. Half are fabricated, and your test suite says PASS. Research shows standard metrics miss 65 to 93% of safety violations (AgentDrift, March 2026): agents invent amenities never in the search results and drift from safe to harmful advice across turns. Binary pass/fail sees "task completed" and misses the lie. Zero-shot hallucination detection finds fabricated facts with no training data. Linear Semantic Consistency (Oct 2025) hits 84.6% AUROC by probing the model's internal states, training free across model families. Claim decomposition verifies atomic statements at 88.4% precision. You'll learn when to use each versus a real-time LLM judge. Trajectory monitoring catches behavioral drift, where an agent slides from legal strategy to gray-area optimization to tax evasion across turns. You'll add per-turn safety scoring that flags drops over 0.3, plus real-time guardrails using lifecycle hooks that swap unsafe output for a safe fallback at 120ms. You'll walk away with: • Zero-shot detection that needs no labeled training data • Per-turn monitoring that catches drift before harm • Real-time guardrails that block unsafe output before delivery
Elizabeth Fuentes Leone
Developer Advocate
San Francisco, California, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top