Google DeepMind shipped four Gemma 4 models with multimodal input, including 31B Dense, 26B MoE, and two edge variants available through AI Studio, Hugging Face, Kaggle, and Ollama. Early community tests say local performance and usable context windows still vary by runtime, quantization, and GPU memory.

You can read Google's post, jump straight to the Hugging Face technical writeup, and browse the Kaggle release. There is also day-0 support in Google AI Studio, vLLM, and a long Hacker News thread full of quantization tips and local deployment chatter.
The biggest change is the license. Google says Gemma 4 is released under Apache 2.0, and Hugging Face treated that as launch-day headline material too Apache 2.0 reaction.
That puts Gemma 4 on the same legal footing as the open model families developers already redistribute, fine-tune, and bundle into products. For a model line that had already reached 400 million downloads and more than 100,000 variants, according to Google's post, that is the part likely to travel furthest.
Google split the family into four tiers:
The company says the 31B model ranks third on Arena AI's open text leaderboard, with the 26B model sixth, and the launch chart argues both compete with far larger open models Benchmark chart. Hugging Face's table matches the context split and notes that all four ship in base and instruction-tuned variants.
Google's own pitch is less chatbot, more local agent. The launch thread calls out app navigation, database search, API triggering, and long action histories, all with native tool use Agentic capabilities.
The Hugging Face post fills in the mechanics:
That last part is catnip for creative tool builders. Google is shipping a model family that can look at interfaces, emit structured coordinates, and keep enough context around to act on them.
Distribution was broad on day one. Google pushed Gemma 4 into AI Studio, while also offering weights through Hugging Face, Kaggle, and Ollama Availability post. Hugging Face says it worked with Google and the community to land support across transformers, llama.cpp, MLX, WebGPU, and Mistral.rs, while vLLM added day-0 support across TPUs, AMD GPUs, and Intel XPUs.
The first community reports already showed the usual local-model split between theory and runtime reality. One HN commenter from Unsloth shared quantization guidance and sampling settings for local runs, while a Reddit user with a 16GB RTX 5060 Ti said larger Gemma 4 quants only became usable after shrinking context windows, with some landing at 64K or 32K instead of the advertised maximum Reddit thread. Another commenter in the same thread framed the smaller models as routing and RAG helpers, not frontier replacements.
That makes Gemma 4 feel less like a single model launch and more like a new local stack to benchmark carefully, especially if the interesting use case is multimodal agents on hardware you already own.
Introducing Gemma 4, our series of open weight (Apache 2.0 licensed) models, which are byte for byte the most capable open models in the world! Gemma 4 is build to run on your hardware: phones, laptops, and desktops. Frontier intelligence with a 26B MOE and a 31B Dense model!
Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵 Show more
Build autonomous agents that plan, navigate apps, and execute multi-step tasks – like searching databases or triggering APIs – with native tool use. With up to 256K context, it can analyze full codebases and retain complex action histories without losing focus.
Gemma 4 is out on Hugging Face blog: huggingface.co/blog/gemma4
Start building with Gemma 4 now in @GoogleAIStudio. You can also download the model weights from @HuggingFace, @Kaggle, or @Ollama. Find out more → goo.gle/41IC3lY