From CNNs to Large Multimodal Models: Choosing the Right Vision AI Architecture

Computer vision has evolved dramatically over the last years - from CNNs detecting objects, to vision transformers understanding scenes, to large multimodal models that can reason about images using language. But which architecture should you actually use?
This session guides you through the three major paradigms in modern computer vision: traditional CNNs, vision transformers, and large multimodal models (LMMs). You'll learn how each works under the hood, understand their strengths and limitations, and see real-world comparisons across different tasks.
Through practical examples, we'll explore when a specialized CNN outperforms a general-purpose LMM, when vision transformers excel, and when multimodal capabilities justify their cost and complexity. You'll leave with a decision framework for choosing the right architecture for your specific vision challenges - and implementation patterns for each.

After 8 years of computer vision implementations - both research prototypes and production systems across healthcare, manufacturing, and sanitary inspections - I've learned that architecture choice matters more than model selection. This talk shares the decision framework we developed across 15+ projects, including real cases where a 2015-era CNN outperforms 2025 LMMs and which architectural patterns actually survive contact with production constraints.

Agata Chudzińska

CTO / AI Solutions Architect at theBlue.ai GmbH

Poznań, Poland

Actions

View Speaker Profile

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Session

From CNNs to Large Multimodal Models: Choosing the Right Vision AI Architecture

Agata Chudzińska

Links

Actions