Session
Hybrid AI. One Pipeline for Cloud and On-Device with Genkit
Cloud-only AI is expensive and breaks offline. Pure on-device is private and cheap but can't handle every task. Hybrid AI combines them — but the real engineering question is how to do that without maintaining two separate stacks, two APIs, two sets of quirks. Genkit, Google's open-source AI framework, now ships natively in Dart through the genkit package on pub.dev (the Genkit Dart preview launched in March 2026 and got front-stage time at Cloud Next '26), which means a Flutter app can speak to Gemini in the cloud and to Gemma running locally on the same device through one consistent contract — no JS server in the middle, no provider-specific glue code, just flows. This talk breaks down what that contract actually looks like under the hood, and which architectural decisions pay off when you ship hybrid into production.
We'll start with Genkit's plugin architecture: the model contract every plugin implements, how the framework abstracts provider-specific quirks (token formats, streaming protocols, function-calling dialects), and how a one-line model swap is possible without lying about the underlying differences. From there we'll go into the three abstraction levels Genkit gives you — raw generation, typed flows, and agents — when each one is the right tool, and how flows become the natural boundary for hybrid routing because they wrap a deterministic input/output schema around a non-deterministic model call. We'll also look at why having Genkit available natively in Dart matters for mobile: the entire pipeline — flow definitions, tool declarations, structured output schemas, retrieval — lives in the same codebase as the UI, with no language boundary to cross.
The on-device side is where most hybrid stories quietly fall apart, so we'll spend real time there. flutter_gemma is the foundation — a Flutter plugin that wraps MediaPipe LLM Inference and LiteRT to run Gemma, Llama, and other open models on Android, iOS, macOS, Windows, Linux, and Web from a single Dart codebase, with multimodal vision/audio, on-device function calling, GPU acceleration, text embeddings, and on-device RAG. On top of that, genkit_flutter_gemma (featured by Google Genkit) plugs that runtime into the Genkit model contract, exposing models like gemma-3-nano and embedders like embedding-gemma-300m so a local call looks identical to a Gemini call at the flow layer. We'll go through how that wrapping actually works, how GPU vs CPU delegation behaves differently on Android and iOS, what the real memory budget looks like for Gemma 3 Nano on a mid-range device, and where on-device function calling diverges from the cloud — different tool-format dialects, smaller context windows, no native vision in some configurations. We'll look at how to write a structured output schema that holds up on both ends, because cloud models will give you clean JSON, and a small local model will not.
Then we'll get into the routing layer — the actual decision of "should this call go cloud or local." Routing on connectivity is the obvious case, but the interesting ones are routing on task complexity (a classifier flow that itself runs on-device through genkit_flutter_gemma), routing on data sensitivity (PII never leaves the device), and routing on latency budget (sub-200ms responses must stay local). Fallback chains for when the cloud is unreachable or the on-device model OOMs, and how to make the fallback transparent to the calling flow without baking provider-specific logic into business code.
We'll also walk through context and session management across the hybrid boundary — how a conversation that started on-device through flutter_gemma can hand off to the cloud mid-thread without losing state, how token budgets reconcile when one side has 128K context and the other has 8K, and what observability looks like when a single user request fans out across both runtimes. Genkit's Developer UI gets concrete here: tracing a hybrid Dart flow end-to-end, seeing which steps ran where, and using local evals to verify that the local path doesn't quietly degrade quality.
You'll leave with a working mental model of how Genkit, the genkit Dart package, flutter_gemma, and genkit_flutter_gemma fit together into a single hybrid inference surface, the architectural patterns that hold up in production, and a clear sense of where the abstraction leaks and how to manage it.
Best fit for engineers building AI features that need to work offline, privacy-sensitive, or cost-bounded; Flutter and Dart developers shipping AI on-device; and architects deciding whether to invest in a hybrid pipeline or stay cloud-only.
Sasha Denisov
Brainform.ai, CTO, Cloud.AI, Flutter, Dart and Firebase GDE
Berlin, Germany
Links
Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.
Jump to top