Elizabeth Fuentes Leone
Developer Advocate
Developer Advocate
San Francisco, California, United States
Actions
As Developer Advocate, helping developers build production-ready AI applications. With a background spanning data analytics, machine learning, and developer education, she specializes in making complex AI concepts accessible through hands-on tutorials, open-source projects, and live demos.
She creates practical resources for RAG systems, agentic workflows, and multimodal applications, focusing on code that developers can deploy immediately. As a conference speaker and workshop instructor, she bridges the gap between cutting-edge AI research and real-world implementation.
En mi rol de Especialista en Análisis de Datos y Aprendizaje Automático/Inteligencia Artificial (ML/AI), mi misión es simplificar conceptos complejos, traduciéndolos a un lenguaje accesible para todos. Me dedico a crear soluciones innovadoras que enfrentan de forma eficaz los retos que surgen en el mundo real. A través de mi participación en conferencias y la creación de recursos educativos, busco compartir mis conocimientos y experiencias con el fin de empoderar a los desarrolladores, ayudándoles a expandir sus habilidades y alcanzar sus objetivos profesionales.
Area of Expertise
Topics
Why AI Agents Forget Everything
Your AI agent helps a user pick a premium option. Next interaction it suggests the cheapest alternative. It forgot everything: preferences, history, instructions. Every conversation starts from zero, so users repeat themselves and abandon the product. Three types of memory loss affect every agent. Memory decay: tools never store what they learn, so preferences vanish between turns. No structured profile: even with state, the agent has no way to build or query one. Memory overload: as memory grows to dozens of sections, dumping everything wastes tokens. This talk covers three progressive fixes that build on each other: persistent state with agent.state and FileSessionManager so preferences survive sessions, the Core Memory Pattern (MIRIX, MemGPT) that gives the agent tools to manage its own memory, and semantic retrieval that loads only relevant sections per query for 60-98% fewer tokens. Each comes with a live demo and real metrics. You'll leave with a decision framework for which pattern fits which use case, plus open-source code for any domain.
Outline: • Why Agents Forget • Fix 1: Persistent State • Fix 2: Core Memory Pattern • Fix 3: Semantic Memory Retrieval • Decision Framework + Resources
Your Agent Works on Localhost. Now Ship It
Your anti-hallucination demos look impressive on your laptop. GraphRAG returns precise answers, guardrails block invalid operations, validation catches fabricated data. Then you ship, and the notebook falls apart: hardcoded keys, in-memory data, no observability, a custom FAISS index nobody wants to maintain. The hard part of agent reliability is not the technique. It is making the technique survive production. This talk shows how 5 anti-hallucination techniques translate from prototype to production: semantic tool routing via MCP (no custom vector index), database-backed steering rules you change in seconds without redeploying, STEER messages that let agents self-correct instead of hard-failing, and GraphRAG on a managed graph database built from 300 documents. A live demo runs 8 scenarios including hallucination attempts and rule violations. You'll walk away with: • A complete production architecture deployable as infrastructure-as-code • Database-backed steering rules you change in seconds, no redeploy • The STEER message pattern for self-correction instead of hard failure • Open-source code with serverless infrastructure and graph database integration
Outline: • The Prototype-to-Production Gap • Semantic Tool Routing via MCP Gateway • Steering Rules in DynamoDB • GraphRAG in Production • Full Production Test • Resources + Q&A
Your Agent's Context Window Is Full. Now What?
A tool returns 214KB of logs. The context window overflows, reasoning quality drops, and your agent returns worse answers with no error thrown. Another retries the same call fourteen times on ambiguous feedback. A third blocks seventeen seconds on a slow tool and times out. None crash. They just cost money and accuracy as data and complexity grow. This hands on workshop teaches four context engineering strategies through three fixes you build yourself: • Externalize and Select with a memory pointer: keep large data outside the window and pull it back by reference, in single and multi agent setups. Seven times fewer tokens, no information loss. • Compress runaway loops with a debounce and tools that return clear states, so the agent knows when to stop. Fourteen calls to two. • Isolate slow tools behind an async handle that returns immediately and polls, so the agent is never frozen. Seventeen seconds to under two. You run the failing version, build the fix, and compare metrics. You leave with code for all three fixes, a decision framework matching each strategy to its failure, and an open source repo. Build along, not slides, and the patterns work with any agent framework.
Outline: • Introduction: The Infinite Window Is a Myth • The Four Context Engineering Strategies • Module 1: Memory Pointer, Single Agent • Module 2: Memory Pointer, Multi Agent • Module 3: Compress Runaway Loops • Module 4: Isolate Slow Tools • Anti-Patterns, Decision Framework, Resources
Build a Voice Agent That Teaches and Remembers
Most AI-powered learning tools recite one-size-fits-all content. Someone studying business English gets the same conversations as someone preparing for travel. Real learning needs an agent that listens, responds, and adapts to the individual. A real-time voice agent solves this. It runs spoken conversations, catches pronunciation and grammar errors as you speak, corrects gently without breaking the flow, and remembers your progress so the next session picks up where you left off. You will see a working implementation: an English conversation practice agent that runs voice conversations, gives real-time feedback, pulls reference material from uploaded documents, and persists learner progress between sessions. The session covers the decisions that make a teaching voice agent work in production: speech-to-speech model versus a modular speech-to-text and text-to-speech pipeline, prompts that teach rather than recite, catching errors without breaking flow, and holding state across sessions without bloating context. The same pattern applies to technical onboarding, compliance training, or any domain that adapts to the individual.
Efficient Agent Memory Retrieval with Semantic Search
Your AI agent has 8 memory sections about a user: persona, travel and food preferences, work schedule, past trips, loyalty programs, emergency contacts. The user asks "What food do I like and what should I avoid?" The naive approach dumps all 8 into context, so the food preferences end up buried under irrelevant emergency contacts, work schedule, and loyalty miles. Tokens wasted, quality degraded, and worse as memory grows to 20 sections. This talk shows why dump-all wastes 60-98% of tokens, how keyword search improves on it but misses synonyms, and how semantic search uses embedding similarity to find conceptually related memories (top-3 per query). A multi-turn scenario loads different sections per query, backed by Zep (94.8% DMR, 90% less latency), PersonaAgent (+56.1% F1), and HippoRAG 2. You'll leave with working semantic search over core memory using SentenceTransformers, a comparison of dump-all, keyword, and semantic retrieval with real token metrics, and open-source code. Most RAG talks search external documents; this applies the same techniques to what the agent knows about the user.
Outline: • The Memory Overload Problem • Scenario 1: Dump All • Scenario 2: Keyword Search • Scenario 3: Semantic Search Top-3 • Scenario 4: Multi-Turn Retrieval • Decision Framework + Resources
The 424 Error Eating Your MCP Agents
Your agent calls an MCP tool that talks to an external API. Fifteen seconds pass. Thirty seconds. Then a cryptic 424 Failed Dependency error kills the whole workflow. The user sees nothing useful, your logs show nothing helpful. MCP tools are black boxes to the calling agent, so one slow API call cascades into full failure, and 300-second hangs burn resources with zero feedback. The async handleId pattern fixes it. start_long_job kicks off the operation and returns a job ID instantly. check_job_status lets the agent poll for results at controlled intervals while staying responsive. Failed or stalled jobs surface as clear error states. The talk builds a complete FastMCP server live, simulating four API behaviors: fast (2s), slow (15s), unresponsive (300s), and failing. You watch real 424 errors happen, then watch the pattern turn 300-second hangs into sub-4-second responses. You'll walk away with: • A FastMCP server implementing the async handleId pattern with job tracking • MCP debugging techniques for finding timeout root causes in production • Production job tracking with status management and cleanup • Client patterns for connecting agents to async MCP tools
Outline: • The 424 Problem • MCP Server Architecture • Async HandleId Pattern • Production Integration • Advanced Patterns and Wrap-Up
When Agents Loop: Cutting 14 Tool Calls to 2
Your AI travel booking agent just charged a customer fourteen times for the same flight. It called the booking tool, got a vague response, was not sure if it worked, and tried again. And again. Fourteen calls where two would do, twenty-one seconds wasted, hundreds of thousands of burned tokens. This is the repeated tool call problem, and it is far more common than you think. The root cause is simple. Ambiguous feedback leaves agents uncertain whether an action succeeded, so they retry. Studies show an average of 3.2x overcalling when tools return unclear responses, and it compounds across multi-step workflows. Three complementary solutions address it at different layers. DebounceHook keeps a sliding window of recent calls and blocks duplicates before they execute. Clear SUCCESS and FAILED states redesign responses so the agent knows when to proceed or stop. LimitToolCounts caps the calls to any one tool as a safety net. You'll walk away with: • A DebounceHook implementation with sliding window detection • A structured tool response pattern with explicit completion states • A live demo cutting 14 calls to 2 and 21s to 4s
Outline: • The Token Waste Problem • DebounceHook: Detect and Block Duplicates • Clear SUCCESS/FAILED States: Prevention by Design • LimitToolCounts: Hard Ceiling Enforcement • Production Patterns and Wrap-Up
Multimodal AI Agents with Long-Term Memory
A user asks your multimodal agent to "compare this video with the one I shared last week," and the agent has no idea what video they mean. You built something intelligent with no memory, and without memory intelligence is just computation. The memory problem is harder than it looks. Text-only agents summarize conversations into strings, but multimodal agents process video frames, image features, audio, and text at once. What should be remembered, the raw video, a description, the embeddings? And how do you retrieve the right memory when a new conversation references something from three sessions ago? Traditional session stores were not designed for this. This talk builds multimodal agents with an open-source agent SDK (similar patterns apply to LangGraph or AutoGen), creates custom video analysis tools, converts them into reusable MCP servers, and adds scalable chat memory with a managed vector store that retrieves memories by semantic search. You'll leave with a multi-agent architecture and patterns for reusable MCP servers you build once and share across your fleet.
Outline: • The Agent That Forgets • Building Multimodal Agents with Strands • Converting Tools to MCP Servers • Scalable Chat Memory with S3 Vectors • The Complete System and Resources
Research Agents That Don't Invent Sources
Your research agent works great in Jupyter. Then it leaks API keys in a stack trace, forgets what it researched two messages ago, and returns three citations, two of which link to pages that do not exist. The demo everyone loved is now a liability. Research agents have production challenges generic deployment guides miss. They call multiple external APIs with different authentication, research iteratively where one query shapes the next so they need context most frameworks discard between turns, and must provide verifiable source attribution because a hallucinated citation destroys trust permanently. This talk shows how credentials leak and how API gateways isolate them while identity management issues per-session credentials that expire. It adds conversation context so the agent builds on its own findings, plus a verification pipeline that checks every citation: the URL exists, the content matches the claim, the source is accessible. You'll leave with a security architecture that keeps credentials out of environment variables, plus patterns for persistent context and verified attribution.
Outline: • The Research Agent That Became a Liability • Securing Credentials with API Gateways • Persistent Conversation Context • Source Verification That Actually Works • The Complete Research Agent and Resources
Ship It: From Agent Demo to Production in Minutes
Your agent demo wowed the team. Six months later you are still rewriting it for production. It forgets users between sessions, you have no idea why it failed at 3 AM, it cannot handle 10 concurrent requests, and last month's bill was four times the estimate. The prototype-to-production gap is where AI projects die. The agent works fine in a notebook. The problem is everything around it: persistent memory, monitoring, infrastructure that scales without manual work, and cost controls. Together these take months when they should take minutes. In this session I take a prototype agent to a production endpoint live: cross-session memory in a managed vector store, zero-code monitoring of every decision and token count, auto-scaling that handles spikes and drops to zero, and cost patterns with per-conversation budgets and caching. You'll walk away with: • A production-readiness deployment checklist • Infrastructure-as-code templates that work with any framework • A cost model that predicts monthly spend before you ship
Outline: • The Six-Month Gap • Cross-Session Memory with S3 Vectors • Zero-Code Monitoring and Observability • Auto-Scaling and Cost Optimization • The Complete Picture and Resources
29 Tools, and Your Agent Picks the Wrong One
Your AI agent has 29 tools. On every call, all 29 descriptions get serialized into the context window, whether the user asks about weather or hotel bookings. That is thousands of wasted tokens per query, and the LLM still picks the wrong tool 15% of the time. Past 10 to 15 tools, the LLM struggles to choose from a crowded context, and every description inflates cost linearly with tool count. Semantic tool selection fixes both. Using FAISS and SentenceTransformers, you embed tool descriptions and search at query time to filter down to the relevant ones before they ever reach the LLM. I'll show three approaches, dynamic tool swapping while preserving memory, and a live comparison of all-tools versus semantic selection on identical queries. You'll walk away with: • A working semantic tool selection implementation with FAISS • A tool registry pattern with embeddings and metadata • Open source code for a 29-tool travel agent system with measured token and error rates
Outline: • The Dual Problem • Solution Architecture • Live Implementation • Production Pattern • Advanced Patterns
When RAG Hallucinates Numbers: Graph-RAG for Precise Answers
Your RAG agent seems smart, until you ask it to count something. "How many items match criteria X?" Traditional RAG fabricates: "approximately 45-50." The real answer from your data? 133. Vector similarity can't count, aggregate, or reason across relationships. The fundamental limitation: Traditional RAG retrieves text chunks by similarity, then asks the LLM to synthesize answers. This works for simple lookups but fails systematically on four query types: counting ("how many?"), aggregation ("what's the average?"), multi-hop reasoning ("what's available at the highest-rated?"), and out-of-domain detection ("any results in Antarctica?". RAG fabricates, Graph-RAG correctly says "none"). I will cover why traditional RAG hallucinates on structured queries (the architectural root cause), how Graph-RAG builds knowledge graphs automatically using neo4j-graphrag without manual schema design, the Text2Cypher pattern that converts natural language into precise database queries the LLM cannot fabricate, a side-by-side comparison on identical queries showing RAG fabrication vs Graph-RAG precision, and production implementation patterns with open-source tools. You'll walk away with: • Graph-RAG implementation with Neo4j and auto entity extraction for any document set • Text2Cypher query generation to get precise answers from knowledge graphs • A concrete decision framework for when to use RAG vs Graph-RAG • Hybrid architecture patterns: Graph-RAG for structured queries, RAG for unstructured • Open-source code adaptable to any domain with structured data (product catalogs, FAQs, inventories) Most RAG talks focus on embeddings and retrieval tuning. This addresses RAG's fundamental limitation: statistical hallucinations on structured data. The solution (knowledge graphs + Cypher) is domain-agnostic and applies wherever your documents contain countable, aggregatable, or relationship-rich data.
Outline: • The RAG Hallucination Problem • Graph-RAG Architecture • Live Implementation • Production Patterns • Decision Framework
When RAG Hallucinates Numbers: Graph-RAG for Precise Answers
Your RAG agent seems smart, until you ask it to count something. "How many items match criteria X?" Traditional RAG fabricates: "approximately 45-50." The real answer from your data? 133. Vector similarity can't count, aggregate, or reason across relationships. The fundamental limitation: Traditional RAG retrieves text chunks by similarity, then asks the LLM to synthesize answers. This works for simple lookups but fails systematically on four query types: counting ("how many?"), aggregation ("what's the average?"), multi-hop reasoning ("what's available at the highest-rated?"), and out-of-domain detection ("any results in Antarctica?". RAG fabricates, Graph-RAG correctly says "none"). I will cover why traditional RAG hallucinates on structured queries (the architectural root cause), how Graph-RAG builds knowledge graphs automatically using neo4j-graphrag without manual schema design, the Text2Cypher pattern that converts natural language into precise database queries the LLM cannot fabricate, a side-by-side comparison on identical queries showing RAG fabrication vs Graph-RAG precision, and production implementation patterns with open-source tools. You'll walk away with: • Graph-RAG implementation with Neo4j and auto entity extraction for any document set • Text2Cypher query generation to get precise answers from knowledge graphs • A concrete decision framework for when to use RAG vs Graph-RAG • Hybrid architecture patterns: Graph-RAG for structured queries, RAG for unstructured • Open-source code adaptable to any domain with structured data (product catalogs, FAQs, inventories) Most RAG talks focus on embeddings and retrieval tuning. This addresses RAG's fundamental limitation: statistical hallucinations on structured data. The solution (knowledge graphs + Cypher) is domain-agnostic and applies wherever your documents contain countable, aggregatable, or relationship-rich data.
Outline: • The RAG Hallucination Problem • Graph-RAG Architecture • Live Implementation • Production Patterns • Decision Framework
$47 a Minute, and Your Agent Calls the Wrong API
Your agent works in dev. In production it costs $47 a minute and calls the wrong API. Production agents fail two ways: financially through runaway cost, and functionally through wrong tool usage. A 5 percent accuracy gain at 20 times the cost is not sustainable. Pareto frontier analysis plots quality against cost across models and finds where no alternative is both cheaper and better. Prompt caching cuts multi-turn cost by 90 percent: an agent re-reading a 20K token system prompt every turn wastes most of its tokens, and caching takes a 10-turn conversation from $4.80 to $0.52. Tool correctness is cheaper than you think. Cascading checks catch 83 percent of errors for free with deterministic rules, more in 1 to 10ms with constraint validation, and the rest with one LLM call. Full coverage costs 20 times less than validating everything with an LLM ($0.12 versus $2.40 for 1,200 cases). You'll walk away with: • Per-invocation cost tracking with budget alarms • Pareto analysis to pick the best-value model • Cascading tool validation that catches errors at 20x lower cost
When Prompts Fail: Enforcing Business Rules in AI Agents
You wrote a tool with a clear docstring: "Maximum 10 guests per booking." Your agent calls it with 15 guests and gets back "SUCCESS." The rule was ignored because prompts and docstrings are suggestions, not constraints. It is the problem web developers solved decades ago: never trust user input, validate on the server. For agents, validate at the tool layer. I'll build a guardrail system live with two parts. Rules defined as Python dataclasses, typed and testable, each naming the tool, parameter, and threshold. And a hook that intercepts every tool call before execution. The demo runs three invalid requests through two agents: the prompt-only one allows all three, the hook-based one blocks all three. Then the upgrade: instead of hard-failing, the hook returns a steer message that guides the agent to fix its own call (15 guests becomes 10) and stay helpful while inside the rules. You'll walk away with: • A hook-based validation pattern that works with any agent framework • Rules as dataclasses you can test and version independently • How to steer the agent to self-correct instead of dead-ending • Open-source code adaptable to payments, compliance, or any domain
Outline: • The Prompt Engineering Failure • Neurosymbolic Architecture • Live Implementation: Blocking • From Blocking to Steering • Production Patterns and Q&A
Catching Hallucinations with Multi-Agent Validation
Your agent confirms an operation with full confidence: reference number, details, status. One problem: the data is fabricated. The agent hallucinated the entire result, and your user won't find out until real-world consequences hit. Single agents have no way to verify their own output. When an LLM generates a plausible response, nothing distinguishes real data from invented data. The agent is equally confident either way. Multi-agent validation solves this with cross-validation between specialized roles. The Executor attempts the operation, the Validator independently checks that the data exists, and the Critic makes the final call. Swarm orchestration lets the agents hand off autonomously with shared context. A live demo runs the same booking through a single agent that fabricates a hotel, then through the pipeline that catches it before the user sees it. You'll walk away with: • The Executor-Validator-Critic pattern for your own agent systems • Swarm orchestration configured for autonomous agent handoffs • A cross-validation pipeline that catches hallucinations before users see them • A framework for deciding when the multi-agent overhead is worth it
Outline: • Single-Agent Hallucination • Multi-Agent Pattern • Live Implementation • Production Patterns • Advanced Applications
Context Engineering: Stop Agents from Choking on Their Own Data
Your agent just ingested 214KB of server logs. No errors, no warnings, looks like it worked. But the response is garbage. The context window silently overflowed, data got truncated, and your agent confidently answered from incomplete information. Tool outputs have no size limits by default, so one API call can return megabytes. Overflow throws no exception, it just degrades quality, and multi-agent systems multiply the problem as data passes between agents with no size controls. The Memory Pointer Pattern fixes this. Store large tool outputs in agent.state via ToolContext instead of returning them, then hand back a lightweight 52-byte pointer to the stored data. Use invocation_state for shared access across agents in Swarm systems. A live demo compresses 214KB to 52 bytes and runs a 3-agent Swarm processing 145KB+ logs in roughly 14 seconds. You'll walk away with: • A working Memory Pointer implementation using ToolContext and agent.state • Multi-agent state sharing patterns using invocation_state • Techniques for detecting silent context overflow in production • Open-source demo code processing 145KB+ logs across multiple agents
Outline: • The Silent Killer • Memory Pointer Pattern Deep Dive • Multi-Agent State Sharing • Production Patterns • Advanced Techniques and Wrap-Up
When RAG Hallucinates Numbers: Graph-RAG for Precise Answers
Your RAG agent seems smart until you ask it to count. "How many items match X?" It fabricates "about 45-50" when the real answer is 133. Vector similarity cannot count, aggregate, or reason across relationships. The root cause is architectural. RAG retrieves text chunks by similarity, then asks the LLM to synthesize an answer. That works for lookups but fails on four query types: counting, aggregation, multi-hop reasoning, and out-of-domain detection. Graph-RAG fixes this by building a knowledge graph automatically (no manual schema design) and using the Text2Cypher pattern to turn natural language into precise queries the LLM cannot fabricate. You will see two agents answer identical queries side by side: RAG invents results, Graph-RAG answers correctly every time. You'll walk away with: • Build Graph-RAG over your own documents • Decide when to use RAG versus Graph-RAG • Combine both in a hybrid retrieval system All code is open source.
Outline: • The RAG Hallucination Problem • Graph-RAG Architecture • Live Implementation • Production Patterns • Decision Framework
Searching 500 Hours of Video Without a 6-Tool Pipeline
You need to find one moment in 500 hours of video: a demo where someone mentions a pricing change while showing a dashboard. The traditional approach extracts frames, runs OCR, transcribes audio, builds text and image embeddings, stores them in separate indices, then queries across all six outputs hoping the timestamps align. That is not retrieval, that is suffering. The problem is architectural. Traditional video RAG treats video as separate modalities that must be decomposed before search. Frame extraction loses temporal context, audio separation loses visual grounding, and separate embedding spaces create alignment nightmares. The orchestration layer becomes the most complex and most fragile part of the system. Multimodal models understand video natively, preserving what is shown, said, and displayed on screen. In this talk I build a video analysis agent live against the six-tool pipeline. You'll walk away with: • A working architecture for production video search with multimodal models • An agent approach that replaces a 200-line orchestration script with one tool call • A decision framework for when traditional decomposition still wins All code is open source.
Outline: • The 500-Hour Problem • Why Decomposition Fails • The Multimodal Shift • Building the Video Agent • When to Use What and Resources
Your AI Agent Isn't Crashing. It's Bleeding Tokens
Your AI agent does not crash; it gets stuck. It produces wrong results when data overflows the context window. It waits forever when an MCP tool calls a slow API. It calls the same tool 14 times because the response said "more results may be available." None of these throw errors. They just waste tokens and time. Three silent failures cost real money. Context overflow: a tool returns 214KB of logs, the window fills, and the agent returns incomplete results with no error. MCP hangs: an external API takes 15 seconds and the agent gets a cryptic 424. Reasoning loops: ambiguous feedback drives 14 retries with zero progress. I will cover three fixes, each with a live demo and before and after metrics. The Memory Pointer Pattern stores large data outside context and returns a pointer (IBM Research, 7x reduction). Async handleId for MCP returns a job ID and polls for results (Octopus, 17.2s to 1.7s). DebounceHook with clear SUCCESS states blocks duplicates (14 to 2). You'll walk away with: • Three production-ready patterns you can implement the same day • Working code with real metrics for each fix • An open-source repository with all demos
Outline: • Three Silent Failures • Fix 1: Memory Pointer Pattern • Fix 2: Async HandleId for MCP • Fix 3: DebounceHook + Clear States • Decision Matrix + Resources
One Guardrail Won't Stop Your Agent Hallucinating
Each hallucination fix solves one problem and leaves four others open. AI agents hallucinate in five distinct ways: fabricating data when retrieval returns nothing, picking wrong tools when descriptions overlap, ignoring business rules, failing on soft constraints, and bypassing hard requirements. One guardrail covers one failure mode. This talk maps each failure to its own defense. Graph-based retrieval computes from structured data instead of guessing, eliminating fabrication. Semantic tool routing through protocol discovery replaces brittle keyword matching. Database-driven rules update in seconds without redeployment. STEER messages guide self-correction, so a request for 15 guests becomes an agent that adjusts to 10 and tells the user. Framework hooks block operations the LLM must never bypass. You'll walk away with: • A layered defense covering all five failure modes • The STEER pattern for self-correction • Database-driven rules that change behavior without redeployment • A decision framework for hard hooks versus soft steering Demonstrated across 8 adversarial scenarios with zero hallucinations.
Outline: • Your AI Agent Hallucinates in 5 Different Ways • Grounded Retrieval with Graph Queries • Semantic Tool Routing • Steering Rules + STEER Messages • Hard Hooks That Cannot Be Bypassed • Full Layered Defense Test • Resources + Q&A
Stop AI Agent Hallucinations: 5 Techniques + Production Patterns
AI agents that book 15 guests in a 10-person room. Agents that fabricate statistics when the data doesn't exist. Agents that pick the wrong tool from 29 options and burn tokens. These aren't prompt engineering failures. They're architectural limitations that need structural solutions. This hands-on workshop covers 5 research-backed techniques. Graph-RAG replaces vector similarity guessing with precise entity relationships, cutting fabricated statistics by 73%. Semantic tool selection filters 29 tools to the relevant 5, for an 89% token reduction. Multi-agent Executor-Validator-Critic swarms catch 92% of fabrications. Neurosymbolic guardrails enforce rules through lifecycle hooks agents cannot bypass. Agent steering guides agents to self-correct instead of hard-failing. Each demo includes live code, before and after metrics, and a final module on production deployment. You'll walk away with working Python implementations, a decision framework for each technique, and an open-source repository adaptable to your domain. No AWS account or cloud costs required. Sandboxes are provided.
Outline: • Introduction - Why AI Agents Hallucinate Differently Than LLMs • Demo 00 - Strands Agents Primer • Demo 01 - Graph-RAG vs. Standard RAG • Demo 02 - Semantic Tool Selection • Demo 03 - Multi-Agent Validation • Demo 04 - Neurosymbolic Guardrails • Demo 05 - Agent Control Steering • Demo 06 - Production on Amazon Bedrock AgentCore • Workshop Recap + Resources • Q&A
Binary Tests Miss 73% of Your Agent's Quality
Your agent passes every test, then makes 3x more API calls than it needs in production. Binary pass/fail metrics only check the final answer, never how the agent got there. Research shows they miss 73% of quality gradations (Grading Scale, Jan 2026). Two fixes close the gap. LLM-as-Judge gives continuous scores from 0.0 to 1.0 with explanations. You'll see why vague prompts ("is this good?") cause position and verbosity bias, and how explicit rubric criteria keep scores consistent at scale. Trajectory evaluation scores the path, not just the answer, catching duplicate tool calls, irrelevant actions, and unsafe steps. AgentDrift (March 2026) found trajectory evaluation detects 91.3% of issues versus 26.4% for output-only. You'll walk away with: • Continuous quality scoring with explicit rubrics • Automatic trajectory capture via lifecycle hooks • A combined pattern wired into cloud observability Production-ready code, grounded in 2026 research.
Your Agent Lies and Passes Every Test
Your agent returns a confident answer. Half are fabricated, and your test suite says PASS. Research shows standard metrics miss 65 to 93% of safety violations (AgentDrift, March 2026): agents invent amenities never in the search results and drift from safe to harmful advice across turns. Binary pass/fail sees "task completed" and misses the lie. Zero-shot hallucination detection finds fabricated facts with no training data. Linear Semantic Consistency (Oct 2025) hits 84.6% AUROC by probing the model's internal states, training free across model families. Claim decomposition verifies atomic statements at 88.4% precision. You'll learn when to use each versus a real-time LLM judge. Trajectory monitoring catches behavioral drift, where an agent slides from legal strategy to gray-area optimization to tax evasion across turns. You'll add per-turn safety scoring that flags drops over 0.3, plus real-time guardrails using lifecycle hooks that swap unsafe output for a safe fallback at 120ms. You'll walk away with: • Zero-shot detection that needs no labeled training data • Per-turn monitoring that catches drift before harm • Real-time guardrails that block unsafe output before delivery
Prairie Dev Con 2026 Sessionize Event Upcoming
PyconUS 2026
How to Build Your First Real-Time Voice Agent in Python (Without Losing Your Mind)
AgentCon - Silicon Valley Sessionize Event
Orlando Code Camp 2026 Sessionize Event
DeveloperWeek 2026
Master Vibe Coding and Deploy AI Agents to Production
PyLadies San Francisco @ LinkedIn
Have a Conversation with Your Videos: Video Analysis Agents in Python"
Python Meetup - Extending AI agents: Custom tools and Model Context Protocol
Extending AI agents: Custom tools and Model Context Protocol
Tech Talk: Moving Agents to production with Strand and Agentcore
Tech Talk: Moving Agents to production with Strand and Agentcore
DevFest Fresno - Build with AI Sessionize Event
DataWeek 2025 Sessionize Event
MCP Dev Day 2005
Tech Talk: Extending AI agents: Custom tools and Model Context Protocol
AICamp Women in AI 2025
Agentic AI: Designing with Intelligence & Autonomy.
Description: About building AI agents for early-career developers with Strands Agents.
Meetup - AWS User Group Ajolotes Ciudad de Mexico
Agentes Multi-Modales con Python: Procesando Imágenes, Videos y Documentos en Pocas Líneas de Código
Pycon US 2025
Construyendo un Buscador Multimodal: Combinando Texto e Imágenes para una Búsqueda Inteligente.
En el mundo actual basado en datos, procesar y analizar eficientemente grandes volúmenes de datos es crucial para muchas aplicaciones. Exploremos juntos cómo crear y administrar embeddings de texto e imágenes para búsqueda de similitudes en una base de datos PostgreSQL. Nos sumergiremos en un ejemplo práctico utilizando Python para demostrar cómo pueden crear buscadores que empleen lenguaje natural.
AWSome Women Summit Latam 2025 Sessionize Event
AWS Community Day Chile 2024 Sessionize Event
AWS Community Day Argentina 2024 Sessionize Event
KCD Argentina 2024 Sessionize Event
AWS Community Day 2024 Sessionize Event
Nerdearla Chile 2024 Sessionize Event
AWS Women Summit 2024 Argentina Sessionize Event
AWS Community Day Uruguay 2023 Sessionize Event
CodeCampSDQ 2023 Sessionize Event
CDK Day 2023 Sessionize Event
AWS UG Perú Conf 2023 Sessionize Event
PyDay Chile 2023 Sessionize Event
Elizabeth Fuentes Leone
Developer Advocate
San Francisco, California, United States
Actions
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top