Skip to content
AI Primer
update

MiniMax M3 users report slow runs and broken code after launch

A day after MiniMax M3 launched, independent testers posted mixed results: cheap demos and design tasks worked, but several coding runs stalled, broke features, or used more tokens than expected. New external numbers added nuance, with Context Arena falling sharply after 64k context and one DeepSWE run passing 15 of 113 tasks.

5 min read
MiniMax M3 users report slow runs and broken code after launch
MiniMax M3 users report slow runs and broken code after launch

TL;DR

MiniMax's own launch post, model page, pricing table, and tool-use docs are worth reading together, because the cheerful benchmark framing leaves the practical caveats scattered across product pages. The interesting bits showed up fast: the API promises a guaranteed minimum of 512K context, not universal 1M availability, prompts above 512K are still limited-access in pay-as-you-go, and the tool docs explicitly describe reasoning between tool calls, which lines up with the token-heavy coding traces users started posting.

Launch claims and pricing

MiniMax introduced M3 as a frontier coding and agent model with native image and video input, desktop control, and a 1M-token context window in the official launch post. The matching product page is more specific: the API supports up to 1M tokens, with a guaranteed minimum of 512K.

The pricing story is similarly split across pages. MiniMax's pay-as-you-go docs list a 7-day 50% launch discount at $0.30 per million input tokens and $1.20 per million output tokens for prompts up to 512K, while prompts above 512K are priced higher and marked limited-access for now.

The benchmark sheet MiniMax pushed on day one centered on coding and agentic tasks. WesRoth's roundup listed 59.0% on SWE-Bench Pro, 66.0% on Terminal Bench 2.1, 34.8% on SWE-fficiency, 28.8% on KernelBench Hard, and 74.2% on MCP Atlas.

Real coding runs

The cleanest early counterweight came from bridgemindai's production-code review, which ran M3 on a real codebase and got broken push-to-talk, a glitched game scene, and a rough video result that still needed two tries. The same post said the full BridgeBench run cost just $4.09, so the cheap part held up better than the quality part.

Before that fuller review landed, bridgemindai's test announcement had already framed the gap directly: M3 looked strong on paper, but the live check was against real TODOs in a production repo rather than benchmark tasks.

A second external datapoint came from DeepSWE. AiBattle_ noted that M2.7 had scored 0% there, and the DeepSWE repost by kimmonismus said M3 passed 15 of 113 tasks, or 19 if over-time runs counted. That is a large step up from M2.7, but it is nowhere near the frontier impression created by day-one benchmark graphics.

MiniMax also reposted an early Chinese hands-on claim that M3 felt close to Opus 4.7 for coding in one morning of use, with one PR already completed, via MiniMax_AI's repost. The split between that optimism and the harsher external runs is the whole story right now.

Token burn and context limits

The most concrete independent write-up on efficiency came from ZhihuFrontier's evaluation summary, which said M3's token consumption was 77% higher than M2.7 and that medium-complexity tasks often reached 60K to 70K tokens. The same write-up credited M3 for stronger reasoning, architecture choices, and self-testing, while saying long instructions and long conversations could still cause requirement drift.

That token behavior is not hard to reconcile with MiniMax's own docs. In the tool-use guide, the company says M3 supports interleaved thinking and reflects before every tool use, which is exactly the kind of harness behavior that can turn coding sessions into long traces.

The long-context story is also narrower than the headline suggests. DillonUzar's Context Arena post measured M3 at 39.2% AUC at 128K versus 25.2% for M2.7, with most of the gains concentrated between 8K and 64K; by 128K, the two models were nearly tied at 17.4% versus 16.8%. Dillon also noted that OpenRouter was exposing only about 524K context at the time, and DillonUzar's follow-up repost added that MiniMax's own API was not yet serving the full 1M context either.

Cheap demos and launch integrations

The happy-path demos looked much better than the repo surgery. testingcatalog's Atomic Chat demo showed M3 reading a hand-drawn sketch, writing the game logic and UI, and shipping a playable HTML platformer in one pass for $0.028.

Cedric Chee's OpenCode tests leaned in the same direction. cedric_chee's voxel pagoda post and cedric_chee's follow-up showed M3 spending about 17 minutes building a detailed voxel pagoda garden scene with lots of backtracking and verification. cedric_chee's CodePen link post also exposed the generated HTML for the scene, which makes the demo easier to inspect than the usual pretty screenshot.

Those examples do not erase the broken-code reports, but they do suggest where M3's early reputation may settle first: cheap multimodal generation, visually impressive one-shot builds, and longer agent traces where cost is low enough to tolerate extra wandering.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
Real coding runs3 posts
Token burn and context limits1 post
Cheap demos and launch integrations2 posts
Share on X