Session

Evolution of Gemma: Architecture, Capabilities, and Ecosystem of Google's Open Model Family

In April 2026, Google released Gemma 4 — and in just over two years, the Gemma family went from a single dense decoder transformer to a multimodal MoE-and-dense lineup that runs everything from a 2B model on a phone to a 31B dense model that scores 1452 on LMArena. The story between Gemma 1 and Gemma 4 isn't just bigger numbers — it's a series of deliberate architectural pivots, each one solving a specific bottleneck the previous generation hit. This talk walks through the four generations side by side: what changed in the architecture, why it changed, what new capabilities came along for the ride, and how the ecosystem of Gemma-derived models grew around the core releases.
We'll start with the attention story, because that's where the most visible engineering happens. Gemma 1 used pure global attention — clean and simple, but it ate memory at long context, which is why the 8K window was the hard ceiling. Gemma 2 introduced GQA (Grouped-Query Attention) and a hybrid pattern alternating local and global attention 1:1, plus soft-capping to stabilize training and knowledge distillation from larger Gemini teachers. Gemma 3 pushed the local-to-global ratio to 5:1 with a 1024-token sliding window, swapped soft-capping for QK-norm for both quality and speed, and as a direct consequence cut KV-cache memory by more than 45% compared to Gemma 2 — that's the change that made 128K context realistic on consumer hardware. Gemma 4 keeps the alternating sliding/global structure but adds Per-Layer Embeddings (PLE) — a lower-dimensional conditioning pathway parallel to the residual stream that produces a small per-layer vector for each token, modulating hidden states layer by layer. PLE is what lets the small E2B and E4B models punch above their weight class on reasoning benchmarks while staying on-device-friendly.
The capability evolution maps onto the architecture story. Gemma 1 was text-only at 8K context. Gemma 2 stayed text-only but got serious about quality through distillation. Gemma 3 added vision (variable aspect ratio and resolution) and pushed context to 128K across all sizes, plus 140+ languages. Gemma 4 goes full multimodal — text, image, video, and native audio (the last one on E2B and E4B specifically, designed for on-device voice agents) — with 128K context on the small models and 256K on the medium ones, native function calling, and a native system role for structured conversations. The 31B dense and 26B MoE (4B active per token, but all 26B in memory for routing) sit at the cloud end of the spectrum.
Then we'll walk through the ecosystem of Gemma-derived models, because the family is much bigger than the core releases. CodeGemma for code completion across Python, Java, C++ and more. PaliGemma 2 as the vision-language sibling built on Gemma 2 capabilities. MedGemma for medical text and images, with a 4B multimodal and a 27B text-only variant on Gemma 3. RecurrentGemma swaps transformer attention for an RNN backbone with local attention, optimizing for memory efficiency on long sequences. ShieldGemma 2 fine-tuned on Gemma 3 4B for safety classification on text and images. T5Gemma, an encoder-decoder collection trading some quality for inference efficiency. We'll cover when each variant is the right tool, what the licensing looks like (Apache 2.0 across the board), and where the ecosystem is heading after Gemma 4.
Finally, we'll get into deployment — because choosing a Gemma model is one decision, but actually running it is another. How MediaPipe LLM Inference and LiteRT serve Gemma on Android and iOS, how flutter_gemma brings the same models to Flutter on desktop and web, what the real memory and latency budgets look like for Gemma 3 Nano vs Gemma 4 E2B/E4B on a mid-range device, when the 26B MoE becomes a smarter cloud choice than a 27B dense, and how Vertex AI Model Garden, Hugging Face, Ollama, and on-device runtimes compare for production deployment.
You'll leave with a clear architectural map of the Gemma family — what changed and why between each generation, which model fits which workload, and how to navigate the ecosystem of variants and runtimes when you're picking the right Gemma for your app.
Best fit for engineers working with open models, mobile and Flutter developers shipping on-device AI, ML practitioners interested in the architectural details behind modern open transformers, and architects evaluating Gemma against other open model families.

Sasha Denisov

Brainform.ai, CTO, Cloud.AI, Flutter, Dart and Firebase GDE

Berlin, Germany

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top