Multimodal RAG: Video Search Without the Pipeline

You need to search 500 hours of video for a specific product demo where someone mentions a pricing change while showing a dashboard. With the traditional approach, you would extract frames at fixed intervals, run OCR on each frame, separate the audio track, transcribe it, generate text embeddings for transcripts, generate image embeddings for frames, store everything in separate vector indices, and then orchestrate a query across all six outputs hoping the timestamps align. That is not retrieval; that is suffering. The problem is architectural. Traditional video RAG treats video as a bundle of separate modalities that must be decomposed before they can be searched. Frame extraction loses temporal context. Audio separation loses visual grounding. Separate embedding spaces create alignment nightmares. The orchestration layer becomes the most complex part of your system, and it is also the most fragile. When it breaks (and it will break) you debug across six different tools trying to figure out where the pipeline lost the answer. In this talk, I will show you: • How traditional video RAG pipelines decompose video into frames, audio, and text, and why each decomposition step loses information • How multimodal models understand video natively, preserving temporal relationships between what is shown, said, and displayed on screen • How unified temporal embeddings eliminate the alignment problem that plagues multi-index approaches • How agent-based architectures turn a 200-line orchestration script into a single tool call • A live demo: building a complete video analysis agent that searches, summarizes, and answers questions about video content in minutes You will walk away with: • A working architecture for production video RAG using multimodal models and pgvector • A decision framework for when traditional decomposition still makes sense (hint: edge cases exist) • Performance benchmarks comparing six-tool pipelines versus single-agent approaches on retrieval accurac

Outline: • The 500-Hour Problem • Why Decomposition Fails • The Multimodal Shift • Building the Video Agent • When to Use What and Resources

Elizabeth Fuentes Leone

Developer Advocate

San Francisco, California, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Multimodal RAG: Video Search Without the Pipeline

Elizabeth Fuentes Leone

Links

Actions