Session

The Evolution of Video RAG

Video RAG used to mean frame extraction, audio separation, multiple embeddings, and complex orchestration pipelines.

A year ago, if you wanted to build semantic search over video content, you'd extract frames, transcribe audio separately, generate embeddings for each modality, calculate similarities, and orchestrate everything yourself. Three hundred lines of Python. Six different libraries. Multiple failure points.

Then multimodal models that natively understand video arrived—and everything changed.

I'll show you both approaches through working code:

The Traditional Pipeline:
- Frame extraction and key frame selection using cosine similarity
- Audio transcription with speaker diarization and timestamps
- Separate embedding generation for visual and audio content
- Vector storage and semantic search with `pgvector`
- Custom orchestration logic to tie it all together

The Modern Approach :
- Native video understanding without decomposition
- Unified temporal embeddings across modalities
- Agent-based architectures that handle orchestration
- Production-ready patterns in a fraction of the code

Join me for a live demonstration of native video understanding without the need to create embedding pipelines. Build a complete video analysis agent in minutes.

Elizabeth Fuentes Leone

Developer Advocate

San Francisco, California, United States

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top