Skip to content
AI Primer
workflow

Gemma 4 26B-A4B runs at 30K context on 16 GB VRAM in community configs

Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.

4 min read
Gemma 4 26B-A4B runs at 30K context on 16 GB VRAM in community configs
Gemma 4 26B-A4B runs at 30K context on 16 GB VRAM in community configs

TL;DR

  • Google's launch materials positioned Gemma 4 as a four-model open family, with E2B and E4B for phones and edge devices, plus a 26B-A4B MoE and 31B dense model for PCs, all under Apache 2.0. The official Google blog post and product page both frame the release around multimodality, long context, and agentic workflows.
  • In a high-signal LocalLLaMA setup post, one user said the 26B-A4B quant could keep vision, fit 30K plus tokens in 16 GB VRAM, and respond at 80 tps plus with tuned sampling and image-token settings.
  • A separate 30-question community eval found the 31B and 26B-A4B models could score near Qwen 3.5 27B on average, but the MoE variant errored out twice and the 31B sometimes took five minutes to finish a response.
  • According to hrishioa's harness notes, Gemma 4's practical quality swings hard with the wrapper around it: Codex worked best in testing, Pi was decent, while Claude Code and OpenCode stumbled on prompting and tool calls.
  • A widely shared phone demo and Google's AI Edge Gallery repo show the smaller Gemma 4 models already running fully offline on iPhone and Android, while Cloudflare's rollout put the 26B-A4B model on Workers AI the same day.

Google shipped the official weights, vLLM added day-0 support, Cloudflare added a hosted 26B-A4B endpoint, and the AI Edge Gallery app updated to feature Gemma 4 on phones. The weirdly useful part is how fast the community moved from launch benchmarks to reproducible local configs, complete with image-token knobs, llama.cpp version warnings, and arguments about which harness wastes the least model IQ.

Gemma 4's local sweet spot is the 26B-A4B

Y
Hacker News

Gemma 4 — Google DeepMind

1.8k upvotes · 462 comments

r/LocalLLaMA

Gemma 4 for 16 GB VRAM

14 comments

The official pitch is broad, but the community converged fast on one practical target: the 26B-A4B mixture-of-experts model for a 16 GB card. The most detailed config writeup favored an Unsloth GGUF, low temperature, low top-k, and --image-min-tokens 300, with the claim that vision quality jumps noticeably once that floor is raised.

That same post is unusually concrete about the tradeoffs:

  • keep vision if you can stay around 30K context in KV fp16
  • drop vision before switching to worse KV cache settings
  • use recent llama.cpp builds, but avoid post-b8660 builds for now because of a tokenizer regression
  • expect much higher throughput than a locally run Qwen 3.5 27B on the same hardware

The release thread on Hacker News pushed in the same direction. HN discussion highlights surfaced immediate tuning notes from Unsloth, plus reports of the 26B-A4B running in a Claude Code-style harness on an M1 Max at roughly 40 tok/s with 37K context.

The model is ahead of the harnesses

r/LocalLLaMA

Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

21 comments

Hrishi O's field report is the clearest explanation for why early Gemma 4 takes feel inconsistent. The claim is not that the weights are weak. It is that wrapper behavior now matters enough to swamp the underlying model, especially for tool use and interleaved reasoning.

The rough consensus from community testing looks like this:

  • Codex and codex --oss handled Gemma 4 best in agentic workflows
  • Pi worked reasonably well, but extensions could confuse the model
  • Claude Code's system prompt felt too heavy for Gemma's reasoning format
  • OpenCode had weaker prompt and tool-call behavior for this family
  • Q4 quantization was usable, but some testers wanted Q8 or better for serious data work

The blind eval post adds the missing caveat. Gemma 4 31B matched Qwen 3.5 27B on average score in that small run, and the 26B-A4B matched the 31B when it worked, but reliability was shakier. The MoE variant failed two prompts outright, while the 31B paid for its quality with occasional multi-minute generations.

Phones and edge runtimes showed up immediately

Google did not just ship model cards. The Google Developers post explicitly pitched Gemma 4 for on-device agents, offline code generation, and Android edge deployment. The AI Gallery thread made that tangible by pointing to Google's own open source mobile app, available on iOS, Android, and GitHub.

Hosted runtimes moved just as fast. Cloudflare's changelog says Workers AI added @cf/google/gemma-4-26b-a4b-it on April 4, and vLLM's launch note says support landed on day zero across Google TPUs, AMD GPUs, and Intel XPUs. That split is probably the real release story: Gemma 4 arrived as a model family, but it immediately turned into a packaging race across phones, laptops, agent harnesses, and hosted inference stacks.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

Share on X