Searching 500 Hours of Video Without a 6-Tool Pipeline

You need to find one moment in 500 hours of video: a demo where someone mentions a pricing change while showing a dashboard. The traditional approach extracts frames, runs OCR, transcribes audio, builds text and image embeddings, stores them in separate indices, then queries across all six outputs hoping the timestamps align. That is not retrieval, that is suffering. The problem is architectural. Traditional video RAG treats video as separate modalities that must be decomposed before search. Frame extraction loses temporal context, audio separation loses visual grounding, and separate embedding spaces create alignment nightmares. The orchestration layer becomes the most complex and most fragile part of the system. Multimodal models understand video natively, preserving what is shown, said, and displayed on screen. In this talk I build a video analysis agent live against the six-tool pipeline. You'll walk away with: • A working architecture for production video search with multimodal models • An agent approach that replaces a 200-line orchestration script with one tool call • A decision framework for when traditional decomposition still wins All code is open source.

Outline: • The 500-Hour Problem • Why Decomposition Fails • The Multimodal Shift • Building the Video Agent • When to Use What and Resources

Elizabeth Fuentes Leone

Developer Advocate

San Francisco, California, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Searching 500 Hours of Video Without a 6-Tool Pipeline

Elizabeth Fuentes Leone

Links

Actions