Session
Building Physical AI: Fine-Tuning VLA Models and Composing Multi-Modal Systems
Physical AI requires a different stack than chatbots. You're fine-tuning vision-language-action models for specific skills, then composing them with perception and voice systems that work in real-time. The integration is where the magic happens—and where most projects fail.
This talk shares the architecture behind an award-winning book-reading robot: a system that opens books, turns pages, sees content, and reads aloud with expressive voices. We fine-tuned VLA models using Action Chunking with Transformers to learn fluid manipulation from human demonstrations, then integrated Claude Vision for page understanding and Eleven Labs for streaming speech synthesis.
I'll walk through the technical stack: how we collected demonstration data and fine-tuned for page-turning skills, why ACT policies outperform discrete skill primitives for manipulation, and how we achieved zero-latency speech through a three-threaded streaming pipeline. Physical AI means designing for graceful failure—because in the real world, errors tear pages.
You'll leave with concrete patterns for fine-tuning VLA models, composing multi-modal physical AI systems, and understanding where the hard problems actually live.
Learning Objectives
* Fine-tune vision-language-action models for specific manipulation skills
*Apply Action Chunking with Transformers for fluid robotic behavior
*Design multi-modal physical AI systems that integrate manipulation, vision, and voice
*Build zero-latency streaming pipelines for real-time human-robot interaction
* Identify failure modes unique to physical AI and design recovery strategies
Level
Intermediate
Tags
Physical AI, VLA Models, Robotics, Fine-Tuning, Vision-Language-Action, Real-time Systems
Will bring the robotic arm and can make available for demonstrations if interested.
Alison Cossette
Data Science Strategist, Advocate, Educator
Burlington, Vermont, United States
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top