Multimodal AI Agents with Long-Term Memory

A user asks your multimodal agent to "compare this video with the one I shared last week," and the agent has no idea what video they mean. You built something intelligent with no memory, and without memory intelligence is just computation. The memory problem is harder than it looks. Text-only agents summarize conversations into strings, but multimodal agents process video frames, image features, audio, and text at once. What should be remembered, the raw video, a description, the embeddings? And how do you retrieve the right memory when a new conversation references something from three sessions ago? Traditional session stores were not designed for this. This talk builds multimodal agents with an open-source agent SDK (similar patterns apply to LangGraph or AutoGen), creates custom video analysis tools, converts them into reusable MCP servers, and adds scalable chat memory with a managed vector store that retrieves memories by semantic search. You'll leave with a multi-agent architecture and patterns for reusable MCP servers you build once and share across your fleet.

Outline: • The Agent That Forgets • Building Multimodal Agents with Strands • Converting Tools to MCP Servers • Scalable Chat Memory with S3 Vectors • The Complete System and Resources

Elizabeth Fuentes Leone

Developer Advocate

San Francisco, California, United States

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

Multimodal AI Agents with Long-Term Memory

Elizabeth Fuentes Leone

Links

Actions