Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.

Google shipped the official weights, vLLM added day-0 support, Cloudflare added a hosted 26B-A4B endpoint, and the AI Edge Gallery app updated to feature Gemma 4 on phones. The weirdly useful part is how fast the community moved from launch benchmarks to reproducible local configs, complete with image-token knobs, llama.cpp version warnings, and arguments about which harness wastes the least model IQ.
Gemma 4 — Google DeepMind
1.8k upvotes · 462 comments
The official pitch is broad, but the community converged fast on one practical target: the 26B-A4B mixture-of-experts model for a 16 GB card. The most detailed config writeup favored an Unsloth GGUF, low temperature, low top-k, and --image-min-tokens 300, with the claim that vision quality jumps noticeably once that floor is raised.
That same post is unusually concrete about the tradeoffs:
The release thread on Hacker News pushed in the same direction. HN discussion highlights surfaced immediate tuning notes from Unsloth, plus reports of the 26B-A4B running in a Claude Code-style harness on an M1 Max at roughly 40 tok/s with 37K context.
Hrishi O's field report is the clearest explanation for why early Gemma 4 takes feel inconsistent. The claim is not that the weights are weak. It is that wrapper behavior now matters enough to swamp the underlying model, especially for tool use and interleaved reasoning.
The rough consensus from community testing looks like this:
codex --oss handled Gemma 4 best in agentic workflowsThe blind eval post adds the missing caveat. Gemma 4 31B matched Qwen 3.5 27B on average score in that small run, and the 26B-A4B matched the 31B when it worked, but reliability was shakier. The MoE variant failed two prompts outright, while the 31B paid for its quality with occasional multi-minute generations.
Google did not just ship model cards. The Google Developers post explicitly pitched Gemma 4 for on-device agents, offline code generation, and Android edge deployment. The AI Gallery thread made that tangible by pointing to Google's own open source mobile app, available on iOS, Android, and GitHub.
Hosted runtimes moved just as fast. Cloudflare's changelog says Workers AI added @cf/google/gemma-4-26b-a4b-it on April 4, and vLLM's launch note says support landed on day zero across Google TPUs, AMD GPUs, and Intel XPUs. That split is probably the real release story: Gemma 4 arrived as a model family, but it immediately turned into a packaging race across phones, laptops, agent harnesses, and hosted inference stacks.
Discussion around Google releases Gemma 4 open models
1.8k upvotes · 462 comments