Fresh practitioner evals used workload harnesses instead of leaderboards: one RTX 4090 test scored Gemma 4 26B over Qwen 3.5 27B by 13-5, and a 9-model calorie test put Sonnet 4.6 first. A finance-routing run reported about 60% savings, so model choice now needs to track latency and task error.

You can read Google's Gemma 4 model page, inspect the AdaptLLM finance-tasks dataset, and compare the live practitioner chatter in the main HN thread. The weirdly useful part of all three evals is that they are not trying to crown a universal winner. They are measuring drift, latency, and cost on narrow jobs that actually show up in products.
The cleanest result in the evidence pool is also the messiest in methodology. One Reddit user ran 18 repeated business-task tests on a local box with an RTX 4090, i9-14900KF, 64 GB RAM, Ubuntu 25.10, and Ollama, then gave Gemma a 13 to 5 win over Qwen.
According to the benchmark writeup, Gemma kept winning tasks like summaries, objections, hooks, story ads, campaign rounds, and technical blueprint generation, while Qwen's wins clustered around broader synthesis, emotional framing, and JSON compilation. A top reply in the same thread argued the comparison may even flatter Qwen, because Gemma 4's mixture-of-experts setup uses about 4B active parameters against Qwen's dense 27B.
The nine-model calorie benchmark is narrower, but better instrumented. The author says the harness used the same production system prompt, the same JSON schema, multiple runs per case, and median end-to-end latency versus mean calorie error.
The attached chart in the LLMDevs post puts Sonnet 4.6 in the bottom-left accuracy slot at about 1.7 percent error, with Opus 4.6 close behind but slower. GPT-5.4 Nano and Mini sit on the fast edge, around 1.5 to 2.5 seconds, but with higher error, while Gemini 3.1 Pro lands as the slowest model at roughly 7.1 seconds without an accuracy payoff.
The routing test is the most product-shaped of the three because it measures orchestration rather than model quality in isolation. The author used public tasks from AdaptLLM's finance-tasks, compared an all-Opus baseline with two routing strategies, and reported about 60 percent blended savings.
The task-by-task breakdown from the Reddit post is more interesting than the average: FiQA sentiment showed 78 percent savings with intra-provider routing and 89 percent with open-source medium-tier models, while ConvFinQA still saved 58 percent in the intra-provider setup. The explanation was simple, many questions inside a long 10-K are still table lookups, so the surrounding document can be complex without every individual prompt needing Opus.
Fresh discussion on Google releases Gemma 4 open models
1.8k upvotes · 473 comments
The HN thread around Gemma 4 has drifted away from launch-day benchmark talk and toward deployment texture. Fresh comments summarized in the newer discussion snapshot describe the 26B-A4B variant holding up in a Claude Code-style harness with roughly 30K to 37K tokens on an M1 Max, while smaller E2B and E4B variants were described as closer to the line for tasks like jq generation and browser inference.
That lines up neatly with the Reddit evidence. Earlier HN discussion already highlighted code-agent throughput and the difference between raw token speed and usable latency, and the fresh update adds one more concrete detail: a commenter said they had already updated a local voice app to run on a MacBook M3 Pro. The story across all three evals is becoming very specific, which model survives the harness you actually have, on the hardware you actually own.
Discussion around Google releases Gemma 4 open models
1.8k upvotes · 473 comments