releaseJune 3, 2026

Gemma 4 12B releases with 256K context and unified audio-vision input

Google’s new Gemma 4 12B ships as an encoder-free open model for text, image, audio, and video tasks with a 256K context window. Early GGUF ports and local benchmarks make it a plausible on-device multimodal option for creator tooling and experimentation.

4 min read

Gemma 4 12B releases with 256K context and unified audio-vision input

TL;DR

Google shipped Gemma 4 12B as a unified multimodal open model, and GoogleDeepMind's launch repost frames it as an encoder-free system built for laptop-class hardware.
The launch adds native audio input to Gemma's mid-sized tier, while DavidmComfort's Google repost describes it as an open model for agentic reasoning, vision, and audio.
Early community ports moved fast: stevibe's Hugging Face link pointed to a day-one GGUF build from Unsloth, and Official Hugging Face weights put the model card on the public record.
First local tests were already posting usable numbers, with Everlier's Strix Halo run reporting 9.93 tokens per second and stevibe's DGX Spark benchmark showing 15.22 to 25.21 tokens per second across quants.

Google's launch post says the model is meant to sit between E4B and the 26B MoE, while the developer guide adds a local llama.cpp and OpenCode setup. The official model card lists text, image, audio, and video inputs, and Unsloth's local run guide pegs 12B Unified at roughly 7 to 8 GB in 4-bit GGUF form. You can also jump straight to the Unsloth GGUF repo, which went live essentially alongside the launch.

Unified multimodal stack

Google's announcement positions 12B as the missing middle of the Gemma 4 line: larger than E4B, smaller than 26B A4B, and still aimed at local use. The same post says the model keeps text, image, audio, and video in one encoder-free stack, ships under Apache 2.0, and targets 16 GB of VRAM or unified memory.

The more useful detail is in Google's developer guide, which says multimodal inputs go straight into the LLM backbone instead of passing through separate vision and audio encoders. That guide also makes 12B the first medium-sized Gemma model with native audio input.

The official model card adds the practical shape of the release:

input types: text, image, audio, and video
output type: text
variants: pre-trained and instruction-tuned weights
positioning: local deployment on consumer devices

Local ports landed immediately

The creator-side story here is speed of packaging. Within hours, stevibe's Hugging Face link was already pointing at Unsloth's GGUF conversion, and Google's launch post says day-one support spans Hugging Face, Kaggle, Ollama, LM Studio, llama.cpp, MLX, vLLM, SGLang, and Unsloth.

Unsloth's hardware table is more concrete than the marketing line. It lists 12B Unified at about 7 to 8 GB total memory in 4-bit GGUF, 13 to 14 GB in 8-bit, and about 25 GB in FP16 or BF16.

Google's developer guide also shows the model wired into OpenCode with llama.cpp, which matters if your workflow is less chatbot and more local agent harness. Everlier's Strix Halo run then supplied the first rough field check: a Q8 XL Unsloth quant on a Ryzen AI 395+ system, served through llama-server with --kv-unified, produced 2,033 tokens in 3 minutes 24 seconds.

Early speed checks

The first quant table was simple and useful. stevibe's DGX Spark benchmark reported four Unsloth quants on DGX Spark:

UD_Q4_K_XL: 25.21 tok/s, 168 ms TTFT
UD_Q5_K_XL: 21.7 tok/s, 182 ms TTFT
UD_Q6_K_XL: 17.68 tok/s, 193.95 ms TTFT
UD_Q8_K_XL: 15.22 tok/s, 221 ms TTFT

That lines up with the release's actual appeal: not frontier-size bragging rights, but a multimodal model small enough that people started treating it like a local component immediately.

Koala microevals

Everlier's koala microeval comparison compared the new 12B dense model against Gemma 4's 26B MoE and 31B dense variants on a koala microeval. The post did not turn that into a grand claim, but it does place 12B in the part of the lineup where people will inevitably ask whether the unified architecture trades too much away for the smaller footprint.

A second note from Everlier's microeval note is the more interesting one for multimodal builders. Everlier said the result was notable specifically because visual comprehension is part of the main transformer stack, which is exactly the architectural bet Google is making in the official docs.

TL;DR

Unified multimodal stack

Local ports landed immediately

Early speed checks

Koala microevals

Discussion across the web