One harness put Sonnet 4.6 near 1.7% calorie error, while a small Rust model claimed 50x faster inference from hybrid attention. Use the rankings carefully: separate posts said code-assistant scores still miss codebase context and per-feature cost.

The meal benchmark came with a simple latency-versus-error chart, the hybrid-attention post bundled the speedup claim with a repo link and a proof-of-concept disclaimer, a LocalLLaMA workflow post described a planner-in-the-cloud and coder-on-Ollama split that claimed 85% token savings, and the code-assistant thread had the cleanest line in the bunch: one tool generated a tutorial endpoint, not an endpoint for their codebase.
The most useful part of AndreiOnyx's post is the setup discipline. Same system prompt, same structured JSON output, same direct provider API calls, then median latency against mean absolute calorie error.
The chart split the field into three recognizable clusters:
The author also gave the caveats engineers actually care about: a tiny test set, a single client location, calorie-only scoring, and nutrition labels that already carry their own noise.
The Rust model post makes two claims that fit together better than the headline suggests. The architectural tweak replaced standard attention with a hybrid of local windowed attention and a GRU-like recurrent state, gated together inside each layer.
The speed claim was dramatic: with a KV cache that kept recent tokens in VRAM and compressed older ones, inference allegedly moved from 5.6 tokens per second to 286 on a 4060 Ti. The quality claim was more restrained. The author said generation quality did not clearly improve, while the bigger gain in validation loss came from expanding the training corpus from roughly 31 MB of core Rust sources to 173 MB with added crates.
That leaves a tidy split between inference engineering and model quality. The same post presents hybrid attention as the fast path, not the main reason the final checkpoint hit 0.82 validation loss and 2.15 perplexity.
One LLMDevs audit is a good reminder that benchmark winners can still be the wrong unit of analysis. The author broke their spend down by feature and found three plain failures:
In a separate LocalLLaMA workflow, another developer described the same instinct in coding form: use Claude for planning, hand the file edits to local Qwen models via Ollama, then validate and auto-fix in loops. The post claimed about 85% token savings on a TypeScript project with 12 files changed.
The strongest critique of comparison culture was not about price or latency. It was about whether a tool can follow the local rules of a real repository.
Their production test asked three tools to add an endpoint to an existing service. The market leader compiled, but picked the wrong authentication middleware, error handling pattern, response envelope, and logging format. The tool with stronger repository context used the team’s existing stack and needed only minor edits. The open source self-hosted option did not complete the task meaningfully.
One commenter in the same thread pushed on the missing tool names. Another answered with the more interesting point: realistic enterprise-context benchmarks are rare because real codebases expose proprietary architecture. That is a big reason why today’s coding-assistant scoreboards still read like generic model tests with an IDE attached.