updateMarch 19, 2026

MiniMax M2.7 benchmarks 34% hallucination rate on new tests

New third-party tests put MiniMax M2.7 at a 34% hallucination rate, roughly 65 tps, and 27.04% on Vibe Code Bench while users pushed it through physics-heavy web demos. It looks increasingly viable for agent workflows, but performance still swings by task and harness.

4 min read

MiniMax M2.7 benchmarks 34% hallucination rate on new tests

TL;DR

Third-party testing now puts MiniMax M2.7 at a much lower hallucination rate than M2.5: the AA-Omniscience chart shows 34% for M2.7 versus 89% for M2.5, while practitioner testing from Cedric Chee also says hallucinations are reduced in a voxel pagoda prompt.
On coding and agent-style evals, the benchmark roundup reports 56.22% on SWE-Pro, 1495 Elo on GDPval-AA, and a 97% adherence rate on “massive, complex skills,” while a separate Vibe Code Bench run places M2.7 at 27.04% for building apps from scratch.
Speed looks roughly flat despite the upgrade: the Zhihu review thread says MiniMax kept average throughput around 65 tokens per second even under “tight computing power pressure,” but the same review says complex reasoning regressed slightly and can consume 50%-100% more tokens.
M2.7 is already shipping in product surfaces and agent stacks: launch coverage says it is live in MiniMax Agent and the API, and Hermes Agent support confirms day-one availability through the MiniMax provider.

What moved in third-party evals?

The cleanest change is hallucination handling. In the AA-Omniscience chart, M2.7 drops to 34%, down from 89% for M2.5, and that places it ahead of GPT-5.4 on that specific benchmark. Cedric Chee's voxel pagoda test lines up with the same direction of travel: he says M2.5 sometimes hallucinated the surrounding garden scene, while M2.7 does so less often.

The coding picture is more mixed but still stronger than the last release. The benchmark roundup claims 56.22% on SWE-Pro, matching GPT-5.3 Codex, plus gains on Terminal Bench 2, VIBE-Pro, Toolathlon, and GDPval-AA. A separate Vibe Code Bench leaderboard is much harsher, placing M2.7 at 27.04% ± 4.18 for end-to-end app generation, but it also tags the run at $2.82 per test and 1377 seconds latency, which is a materially cheaper profile than the top GPT and Claude entries in that table.

Where does M2.7 still break down?

The best independent review here is less bullish than the launch chatter. According to the Zhihu review thread, M2.7 improved “direct/indirect instruction execution” and context hallucination, but stability is still uneven: it can score full marks on long code derivation, then fall to “unusable” on medium-complexity tasks because of misread instructions or repeated fixes. The same review says there is “no substantial upgrade” in high-level engineering design, even if the model now more often writes SPEC.md and README.md to track project logic.

Cedric Chee's thread summary makes a similar distinction. He calls out better “real-world engineering” and “professional office delivery,” but frames M2.7 as an early step in model self-evolution rather than a broad jump in general intelligence. The Zhihu review thread is explicit that hard reasoning regressed slightly, with 50%-100% higher token use from excessive enumeration and more max-token failures on complex tasks.

How are engineers trying it in agent stacks?

MiniMax shipped M2.7 directly into production surfaces instead of keeping it as a paper release. The launch coverage says it is available immediately in MiniMax Agent and via API, and MiniMax's release post is the main product reference. Teknium's Hermes Agent support adds that it landed in Hermes Agent through the MiniMax provider on day one.

The practical usage pattern emerging around M2.7 is agent orchestration plus browser-native coding. A MaxClaw session slide describes a “100,000+ scalable” agent system with components including an LLM gateway, MCP server, and MicroVM sandboxing. On the application side, user demos show M2.7 generating self-contained HTML and physics-heavy web experiences: one receipt physics demo builds a draggable cloth-style receipt simulation in a single HTML file, and another water physics demo extends that to a temperature-controlled water simulation. Those demos are anecdotal, but they fit the narrower claim in the benchmark roundup that M2.7 is strongest in multi-round edits and agentic delivery rather than pure reasoning peaks.

TL;DR

What moved in third-party evals?

Where does M2.7 still break down?

How are engineers trying it in agent stacks?

Discussion across the web