Gemma 4 12B releases with 256K context and unified audio-vision input
Google’s new Gemma 4 12B ships as an encoder-free open model for text, image, audio, and video tasks with a 256K context window. Early GGUF ports and local benchmarks make it a plausible on-device multimodal option for creator tooling and experimentation.

TL;DR
- Google shipped Gemma 4 12B as a unified multimodal open model, and GoogleDeepMind's launch repost frames it as an encoder-free system built for laptop-class hardware.
- The launch adds native audio input to Gemma's mid-sized tier, while DavidmComfort's Google repost describes it as an open model for agentic reasoning, vision, and audio.
- Early community ports moved fast: stevibe's Hugging Face link pointed to a day-one GGUF build from Unsloth, and Official Hugging Face weights put the model card on the public record.
- First local tests were already posting usable numbers, with Everlier's Strix Halo run reporting 9.93 tokens per second and stevibe's DGX Spark benchmark showing 15.22 to 25.21 tokens per second across quants.
Google's launch post says the model is meant to sit between E4B and the 26B MoE, while the developer guide adds a local llama.cpp and OpenCode setup. The official model card lists text, image, audio, and video inputs, and Unsloth's local run guide pegs 12B Unified at roughly 7 to 8 GB in 4-bit GGUF form. You can also jump straight to the Unsloth GGUF repo, which went live essentially alongside the launch.
Unified multimodal stack
Google's announcement positions 12B as the missing middle of the Gemma 4 line: larger than E4B, smaller than 26B A4B, and still aimed at local use. The same post says the model keeps text, image, audio, and video in one encoder-free stack, ships under Apache 2.0, and targets 16 GB of VRAM or unified memory.
The more useful detail is in Google's developer guide, which says multimodal inputs go straight into the LLM backbone instead of passing through separate vision and audio encoders. That guide also makes 12B the first medium-sized Gemma model with native audio input.
The official model card adds the practical shape of the release:
- input types: text, image, audio, and video
- output type: text
- variants: pre-trained and instruction-tuned weights
- positioning: local deployment on consumer devices
Local ports landed immediately
The creator-side story here is speed of packaging. Within hours, stevibe's Hugging Face link was already pointing at Unsloth's GGUF conversion, and Google's launch post says day-one support spans Hugging Face, Kaggle, Ollama, LM Studio, llama.cpp, MLX, vLLM, SGLang, and Unsloth.
Unsloth's hardware table is more concrete than the marketing line. It lists 12B Unified at about 7 to 8 GB total memory in 4-bit GGUF, 13 to 14 GB in 8-bit, and about 25 GB in FP16 or BF16.
Google's developer guide also shows the model wired into OpenCode with llama.cpp, which matters if your workflow is less chatbot and more local agent harness. Everlier's Strix Halo run then supplied the first rough field check: a Q8 XL Unsloth quant on a Ryzen AI 395+ system, served through llama-server with --kv-unified, produced 2,033 tokens in 3 minutes 24 seconds.
Early speed checks
The first quant table was simple and useful. stevibe's DGX Spark benchmark reported four Unsloth quants on DGX Spark:
- UD_Q4_K_XL: 25.21 tok/s, 168 ms TTFT
- UD_Q5_K_XL: 21.7 tok/s, 182 ms TTFT
- UD_Q6_K_XL: 17.68 tok/s, 193.95 ms TTFT
- UD_Q8_K_XL: 15.22 tok/s, 221 ms TTFT
That lines up with the release's actual appeal: not frontier-size bragging rights, but a multimodal model small enough that people started treating it like a local component immediately.
Koala microevals
Everlier's koala microeval comparison compared the new 12B dense model against Gemma 4's 26B MoE and 31B dense variants on a koala microeval. The post did not turn that into a grand claim, but it does place 12B in the part of the lineup where people will inevitably ask whether the unified architecture trades too much away for the smaller footprint.
A second note from Everlier's microeval note is the more interesting one for multimodal builders. Everlier said the result was notable specifically because visual comprehension is part of the main transformer stack, which is exactly the architectural bet Google is making in the official docs.