breakingApril 19, 2026

Gemma 4 ecosystem ships 60+ on-device demos and local agent benchmarks

A weekend of Gemma 4 demos spanned YC hackathon projects, offline iPhone runs, and HN reports of strong local coding and SQL-agent performance. Gemma 4 is increasingly showing up as a practical edge model for tool use and multimodal apps, not just a release benchmark.

5 min read

Gemma 4 ecosystem ships 60+ on-device demos and local agent benchmarks

TL;DR

DynamicWebPaige's hackathon post said more than 60 teams demoed Gemma 4 and Cactus Compute projects at Y Combinator, with voice agents, smart glasses, and coding assistants all pitched as running on-device.
According to rohanpaul_ai's iPhone demo post, Gemma 4's edge-optimized E2B and E4B variants can run fully offline on an iPhone through apps like Locally AI or Google AI Edge Gallery, after downloading a roughly 1.5 GB quantized model.
the HN discussion roundup surfaced the part engineers actually care about: tool calling and multimodal support, about 40 tokens per second for 26B-A4B in one code-agent-style setup, and surprisingly decent SQL-agent scores from E2B and E4B.
A few days after launch, vllm_project's release note post was already calling out Gemma 4 stability work in vLLM v0.19.1, while osanseviero's SF recap showed meetups spanning Unsloth, MLX, vLLM, Ollama, SGLang, and NVIDIA.

You can browse the official Gemma 4 page, watch the YC demo reel from DynamicWebPaige, and dig through the main HN thread where commenters got specific about runtimes, code-agent speed, and SQL evals. The weirdly useful bit is how fast the ecosystem moved: DynamicWebPaige's hackathon line photo showed a packed voice-agent event, onusoz's OpenClaw thread was already benchmarking model formats and hot-swapping local models, and a LocalLLaMA post about broken function calling showed the rough edges arriving just as fast.

YC demos

The YC weekend looked less like a model launch afterparty and more like a deployment sprint. DynamicWebPaige's thread opener listed podcast control, multimodal glasses, Codex and Claude Code narrators, and other projects that were all framed as 100 percent on-device.

Four concrete demos stood out in DynamicWebPaige's follow-up thread and the grand-prize update:

Meeting notes to website edits: transcribe and summarize notes with Gemma 4, extract action items, then use the Gemini API to change a GitHub repo meeting-notes demo post.
PodDJ: navigate and jump inside podcasts with function calling, including natural-language seeks like "go to the part where they start talking about technical stuff" PodDJ mention in the thread.
FireSite: use Gemma on Ray-Ban glasses for photos, audio capture, transcription, and a generated PDF safety report for firefighters FireSite post.
Lookout: review security camera footage locally and let guards query footage in natural language, which DynamicWebPaige's winner post said took the grand prize.

Phones and edge runtimes

The practical pitch for Gemma 4 is small enough to carry around. rohanpaul_ai's post said the E2B and E4B variants run fully offline on iPhone through Locally AI or Google AI Edge Gallery, with inference handled on Apple Neural Engine.

The launch-party slide in

put numbers behind that positioning. It listed E2B, E4B, and 26B A4, plus a claim of 25 million plus Hugging Face downloads in the first week and 4 million plus downloads on AI Edge Gallery.

The official Gemma 4 page makes the same bet explicit: E2B and E4B are the mobile and IoT-sized variants, while 26B A4B and 31B target personal-computer class inference with multimodal reasoning, function calling, and 140 plus languages.

Local agent harnesses

Discussion around Google releases Gemma 4 open models

Thread discussion highlights: - danielhanchen on release features and runtimes: Thinking / reasoning + multimodal + tool calling... quants... for folks to run them... temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>". - d4rkp4ttern on code-agent performance: For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code... the 26B-A4B variant is head and shoulders above recent open-weight models... ~40 tokens/sec. - nl on agentic SQL generation: Gemma-4-E4B-it scored 15/25 on my sql-benchmark... Gemma-4-E2B... scored 12/25... That's a great score for a small model.

The most interesting HN comments were not about leaderboard screenshots. In the main thread, one highlighted comment said Gemma 4 shipped with thinking, multimodality, tool calling, and community quants, while another said the 26B-A4B variant was "head and shoulders above" recent open-weight models in a Claude Code-like harness at roughly 40 tokens per second on an M1 Max setup HN discussion highlights.

That thread also surfaced a small-model result that matters for agent builders. the same HN roundup quoted an independent SQL benchmark where Gemma-4-E4B-it scored 15 out of 25 and Gemma-4-E2B scored 12 out of 25.

By the weekend, people were already treating Gemma as a swappable local worker model. onusoz's OpenClaw post described a /model vllm/gemma-e4b flow, automatic memory loading, insufficient-memory failures, and a benchmark plan across vLLM, llama-swap, LM Studio, and Ollama.

That same post included a format table for the E4B model, spanning original safetensors, several GGUF variants, and MLX builds

. The table is the most honest state-of-the-stack artifact in this story: local Gemma usage is already about runtime and quantization choices as much as the base model itself.

Function calling is not uniform

r/LocalLLaMA

gemma4:26b function calling not working

1 comments

The early usage reports are not all smooth. In a LocalLLaMA post, one user said gemma4:31b-cloud worked well with Claude Code, while a local gemma4:26b setup ignored commands, skipped tools and MCP, and failed basic project exploration.

r/openclaw

Has anyone tried the new Gemma family from Google? 'Frontier models running locally'.

0 comments

That complaint landed next to a second community signal: an OpenClaw post explicitly framed Gemma as a possible way out of expensive Claude-in-the-cloud agent runs. The gap between those two posts is the current Gemma 4 story in one shot, strong demand for local agents, uneven tool behavior in real harnesses.

Runtime support is already moving

Infrastructure support started shifting almost immediately. vllm_project's release note post said vLLM v0.19.1 added Gemma 4 stability work as part of a broader Transformers jump to v5.5.4, which is a useful tell that serving Gemma 4 still needed fast-follow fixes.

The community footprint is already much wider than a single Google release page. osanseviero's SF recap counted an SF meetup, the YC hackathon, an Ollama meetup with SGLang, and an NVIDIA developer meetup in the same week, while DynamicWebPaige's launch-party post showed vLLM, Unsloth, Ollama, Hugging Face, Cactus Compute, Apple, NVIDIA, and PyTorch all in the room.

The last new fact is the easiest one to miss: DynamicWebPaige's winner post said the grand-prize demo, Lookout, was built in Rust. Christmas came early for local-agent nerds, but it also came with a reminder that the interesting work is already happening one layer below the model card.