breakingApril 6, 2026

Practitioner evals rank Gemma 4, Qwen 3.5, Claude 4.6, GPT-5.4, and Gemini 3.1 on task runs

Fresh practitioner evals used workload harnesses instead of leaderboards: one RTX 4090 test scored Gemma 4 26B over Qwen 3.5 27B by 13-5, and a 9-model calorie test put Sonnet 4.6 first. A finance-routing run reported about 60% savings, so model choice now needs to track latency and task error.

4 min read

Practitioner evals rank Gemma 4, Qwen 3.5, Claude 4.6, GPT-5.4, and Gemini 3.1 on task runs

TL;DR

In one practitioner run on an RTX 4090, a Reddit benchmark in r/ArtificialInteligence scored Gemma 4 26B over Qwen 3.5 27B by 13 to 5 across 18 business-oriented tests, with the author saying Gemma was both faster and less likely to drift off the source document.
In a separate nine-model API benchmark, a post in r/LLMDevs put Claude Sonnet 4.6 at roughly 1.7 percent mean calorie error, while GPT-5.4 Nano and Mini led on latency at about 1.5 to 1.7 seconds.
A finance-focused routing experiment from r/MachineLearning reported about 60 percent blended savings by sending simpler prompts to cheaper models and keeping only harder cases on Claude Opus.
The interesting connective tissue is deployment, not leaderboard glory: Hacker News discussion around Gemma 4 kept circling back to token throughput, long-context agent use, and whether a model feels fast enough inside real harnesses.

You can read Google's Gemma 4 model page, inspect the AdaptLLM finance-tasks dataset, and compare the live practitioner chatter in the main HN thread. The weirdly useful part of all three evals is that they are not trying to crown a universal winner. They are measuring drift, latency, and cost on narrow jobs that actually show up in products.

Gemma 4 vs Qwen 3.5 on a single 4090

The cleanest result in the evidence pool is also the messiest in methodology. One Reddit user ran 18 repeated business-task tests on a local box with an RTX 4090, i9-14900KF, 64 GB RAM, Ubuntu 25.10, and Ollama, then gave Gemma a 13 to 5 win over Qwen.

According to the benchmark writeup, Gemma kept winning tasks like summaries, objections, hooks, story ads, campaign rounds, and technical blueprint generation, while Qwen's wins clustered around broader synthesis, emotional framing, and JSON compilation. A top reply in the same thread argued the comparison may even flatter Qwen, because Gemma 4's mixture-of-experts setup uses about 4B active parameters against Qwen's dense 27B.

Sonnet 4.6 won accuracy, GPT-5.4 won speed

The nine-model calorie benchmark is narrower, but better instrumented. The author says the harness used the same production system prompt, the same JSON schema, multiple runs per case, and median end-to-end latency versus mean calorie error.

The attached chart in the LLMDevs post puts Sonnet 4.6 in the bottom-left accuracy slot at about 1.7 percent error, with Opus 4.6 close behind but slower. GPT-5.4 Nano and Mini sit on the fast edge, around 1.5 to 2.5 seconds, but with higher error, while Gemini 3.1 Pro lands as the slowest model at roughly 7.1 seconds without an accuracy payoff.

Routing cut cost even on 10-K question answering

The routing test is the most product-shaped of the three because it measures orchestration rather than model quality in isolation. The author used public tasks from AdaptLLM's finance-tasks, compared an all-Opus baseline with two routing strategies, and reported about 60 percent blended savings.

The task-by-task breakdown from the Reddit post is more interesting than the average: FiQA sentiment showed 78 percent savings with intra-provider routing and 89 percent with open-source medium-tier models, while ConvFinQA still saved 58 percent in the intra-provider setup. The explanation was simple, many questions inside a long 10-K are still table lookups, so the surrounding document can be complex without every individual prompt needing Opus.

Gemma's community case is now about harness behavior

Hacker News

Fresh discussion on Google releases Gemma 4 open models

1.8k upvotes · 473 comments

The HN thread around Gemma 4 has drifted away from launch-day benchmark talk and toward deployment texture. Fresh comments summarized in the newer discussion snapshot describe the 26B-A4B variant holding up in a Claude Code-style harness with roughly 30K to 37K tokens on an M1 Max, while smaller E2B and E4B variants were described as closer to the line for tasks like jq generation and browser inference.

That lines up neatly with the Reddit evidence. Earlier HN discussion already highlighted code-agent throughput and the difference between raw token speed and usable latency, and the fresh update adds one more concrete detail: a commenter said they had already updated a local voice app to run on a MacBook M3 Pro. The story across all three evals is becoming very specific, which model survives the harness you actually have, on the hardware you actually own.

🧾 More sources

Hacker News

Discussion around Google releases Gemma 4 open models

1.8k upvotes · 473 comments