Gemma
Stories, products, and related signals connected to this tag in Explore.
Stories
Filter storiesNew reports show Qwen3.6 and Gemma 4 running locally across Apple and Nvidia setups, with wide variance tied to context length, runtime choice, and MTP tuning. This matters because the latest open models are reaching usable agent speeds on consumer hardware, but prefill and long-context performance still cap throughput.
Together AI launched Gemma-4-31B-it-Pearl as a serverless endpoint that uses Pearl's proof-of-useful-work emissions to offset inference cost. It matters because the pricing model ties serving economics to compute-side byproducts instead of token billing alone.
agent-tui shipped v0.2.0 with markdown rendering, tool approvals, configurable reasoning views, and an AI SDK-only dependency chain. The demo also showed Gemma 4 31B running locally, so the terminal UI now covers hosted and on-device models.
Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.
Community ports brought Gemma 4 multi-token prediction into llama.cpp and MLX Swift, with one M5 Max report moving from 97 to 138 tok/s and another showing 30-40% faster decoding. The gains extend MTP into local runtimes used for on-device coding and long-context work.
Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.
Google DeepMind introduced Decoupled DiLoCo, a distributed-training method that trained a 12B Gemma model across four US regions and mixed TPU6e/v5p hardware while tolerating failures. It matters because it targets the networking and uptime bottlenecks that make frontier training geographically rigid and operationally fragile.
A weekend of Gemma 4 demos spanned YC hackathon projects, offline iPhone runs, and HN reports of strong local coding and SQL-agent performance. Gemma 4 is increasingly showing up as a practical edge model for tool use and multimodal apps, not just a release benchmark.
HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.
Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.
Google DeepMind released Gemma 4 in four open models with up to 256K context, multimodal inputs, and native tool-calling for local agent workflows. Day-0 support across serving stacks and benchmark wins make it ready for phones, laptops, and server GPUs.
A Google bot-authored LiteRT-LM pull request references Gemma4 and AIcore NPU support, while multiple posts claim a largest version around 120B total and 15B active parameters. Engineers targeting on-device inference should wait for a formal model card before locking plans.