Skip to content
AI Primer
release

Gemma 4 12B releases with 256K context and unified audio-vision input

Google’s new Gemma 4 12B ships as an encoder-free open model for text, image, audio, and video tasks with a 256K context window. Early GGUF ports and local benchmarks make it a plausible on-device multimodal option for creator tooling and experimentation.

4 min read
Gemma 4 12B releases with 256K context and unified audio-vision input
Gemma 4 12B releases with 256K context and unified audio-vision input

TL;DR

Google's launch post says the model is meant to sit between E4B and the 26B MoE, while the developer guide adds a local llama.cpp and OpenCode setup. The official model card lists text, image, audio, and video inputs, and Unsloth's local run guide pegs 12B Unified at roughly 7 to 8 GB in 4-bit GGUF form. You can also jump straight to the Unsloth GGUF repo, which went live essentially alongside the launch.

Unified multimodal stack

Google's announcement positions 12B as the missing middle of the Gemma 4 line: larger than E4B, smaller than 26B A4B, and still aimed at local use. The same post says the model keeps text, image, audio, and video in one encoder-free stack, ships under Apache 2.0, and targets 16 GB of VRAM or unified memory.

The more useful detail is in Google's developer guide, which says multimodal inputs go straight into the LLM backbone instead of passing through separate vision and audio encoders. That guide also makes 12B the first medium-sized Gemma model with native audio input.

The official model card adds the practical shape of the release:

  • input types: text, image, audio, and video
  • output type: text
  • variants: pre-trained and instruction-tuned weights
  • positioning: local deployment on consumer devices

Local ports landed immediately

The creator-side story here is speed of packaging. Within hours, stevibe's Hugging Face link was already pointing at Unsloth's GGUF conversion, and Google's launch post says day-one support spans Hugging Face, Kaggle, Ollama, LM Studio, llama.cpp, MLX, vLLM, SGLang, and Unsloth.

Unsloth's hardware table is more concrete than the marketing line. It lists 12B Unified at about 7 to 8 GB total memory in 4-bit GGUF, 13 to 14 GB in 8-bit, and about 25 GB in FP16 or BF16.

Google's developer guide also shows the model wired into OpenCode with llama.cpp, which matters if your workflow is less chatbot and more local agent harness. Everlier's Strix Halo run then supplied the first rough field check: a Q8 XL Unsloth quant on a Ryzen AI 395+ system, served through llama-server with --kv-unified, produced 2,033 tokens in 3 minutes 24 seconds.

Early speed checks

The first quant table was simple and useful. stevibe's DGX Spark benchmark reported four Unsloth quants on DGX Spark:

  • UD_Q4_K_XL: 25.21 tok/s, 168 ms TTFT
  • UD_Q5_K_XL: 21.7 tok/s, 182 ms TTFT
  • UD_Q6_K_XL: 17.68 tok/s, 193.95 ms TTFT
  • UD_Q8_K_XL: 15.22 tok/s, 221 ms TTFT

That lines up with the release's actual appeal: not frontier-size bragging rights, but a multimodal model small enough that people started treating it like a local component immediately.

Koala microevals

Everlier's koala microeval comparison compared the new 12B dense model against Gemma 4's 26B MoE and 31B dense variants on a koala microeval. The post did not turn that into a grand claim, but it does place 12B in the part of the lineup where people will inevitably ask whether the unified architecture trades too much away for the smaller footprint.

A second note from Everlier's microeval note is the more interesting one for multimodal builders. Everlier said the result was notable specifically because visual comprehension is part of the main transformer stack, which is exactly the architectural bet Google is making in the official docs.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
TL;DR1 post
Share on X