Skip to content
AI Primer

Gemma is Google DeepMind's family of open AI models, including multimodal releases and multiple generations under the Gemma name.

Pricing

Model profile · Current snapshot
Input / 1M
$0.02
Output / 1M
$0.04
Blended / 1M
$0.025
Output TPS
49.45
TTFT (s)
0.52

Model Intelligence

Arena ranking
1
Benchmarkable
No
Model level
family
Intelligence Index
3.4
Coding Index
5.8
Math Index
18.3
MMLU Pro
0.6
GPQA
0.35
HLE
0.05
LiveCodeBench
0.14
SciCode
0.17
MATH-500
0.85
AIME
0.22
AIME 2025
0.18
IFBench
0.37
LCR
0.07
TerminalBench Hard
0.01
TAU2
0.11

Recent stories

12 linked stories
releaseSECONDARY2026-06-28
DeepSeek releases DSpark checkpoints for Qwen3 and Gemma-4

DeepSeek extended DSpark beyond V4 by publishing draft-model checkpoints for Qwen3 and Gemma-4 families and clarifying that DSpark targets higher-throughput serving by controlling verification cost. The release matters because speculative decoding is moving from papers into reusable open checkpoints.

releasePRIMARY2026-06-05
Google releases Gemma 4 QAT: E2B drops to ~1GB and Ollama, SGLang, vLLM add support

Google published Gemma 4 QAT checkpoints and mobile-focused quant formats, cutting Gemma 4 E2B to roughly 1GB of memory. Ollama, SGLang, and vLLM added day-one support, making local deployment more practical on phones, laptops, and low-VRAM GPUs.

releasePRIMARY2026-06-03
Gemma 4 12B ships encoder-free multimodal local model with 16GB target and 256K context

Google released Gemma 4 12B, an Apache 2.0 encoder-free multimodal model with native audio and vision for 16GB-class laptops. Day-zero support in llama.cpp, vLLM, Ollama, MLX, and SGLang should make local agents and on-device apps easier to deploy immediately.

releaseSECONDARY2026-05-15
Together AI launches Gemma-4-31B-it-Pearl endpoint with 25%+ discounted pricing

Together AI launched Gemma-4-31B-it-Pearl as a serverless endpoint that uses Pearl's proof-of-useful-work emissions to offset inference cost. It matters because the pricing model ties serving economics to compute-side byproducts instead of token billing alone.

releaseSECONDARY2026-05-12
agent-tui 0.2.0 adds markdown rendering, tool approvals, and local Gemma 4 support

agent-tui shipped v0.2.0 with markdown rendering, tool approvals, configurable reasoning views, and an AI SDK-only dependency chain. The demo also showed Gemma 4 31B running locally, so the terminal UI now covers hosted and on-device models.

newsPRIMARY2026-05-10
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

releasePRIMARY2026-05-05
Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

releaseSECONDARY2026-04-23
Google DeepMind releases Decoupled DiLoCo with 12B Gemma training across 4 US regions

Google DeepMind introduced Decoupled DiLoCo, a distributed-training method that trained a 12B Gemma model across four US regions and mixed TPU6e/v5p hardware while tolerating failures. It matters because it targets the networking and uptime bottlenecks that make frontier training geographically rigid and operationally fragile.

newsPRIMARY2026-04-19
Gemma 4 ecosystem ships 60+ on-device demos and local agent benchmarks

A weekend of Gemma 4 demos spanned YC hackathon projects, offline iPhone runs, and HN reports of strong local coding and SQL-agent performance. Gemma 4 is increasingly showing up as a practical edge model for tool use and multimodal apps, not just a release benchmark.

newsPRIMARY2026-04-06
Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests

HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.

workflowPRIMARY2026-04-04
Gemma 4 26B-A4B runs at 30K context on 16 GB VRAM in community configs

Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.

releasePRIMARY2026-04-02
Gemma 4 ships 31B Dense and 26B MoE open models under Apache 2.0

Google DeepMind released Gemma 4 in four open models with up to 256K context, multimodal inputs, and native tool-calling for local agent workflows. Day-0 support across serving stacks and benchmark wins make it ready for phones, laptops, and server GPUs.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.