Skip to content
AI Primer
update

MiniMax M2.7 ranks #5 on PinchBench at $0.30 per million input tokens

Kilo said MiniMax M2.7 placed fifth on PinchBench, 1.2 points behind Opus 4.6 at much lower input cost, while community tests showed strong multi-loop agent behavior on graphics tasks. If you route coding-agent traffic by price, M2.7 looks worth a controlled bake-off.

3 min read
MiniMax M2.7 ranks #5 on PinchBench at $0.30 per million input tokens
MiniMax M2.7 ranks #5 on PinchBench at $0.30 per million input tokens

TL;DR

  • Kilo says MiniMax M2.7 now ranks fifth out of 50 models on PinchBench, scoring 86.2% and landing just 1.2 points behind Opus 4.6, while charging $0.30 per million input tokens according to Kilo benchmark post.
  • In Kilo's own writeup, M2.7 also posted a 47% pass rate on the 89-task Kilo Bench and showed a "3.7-point improvement" over M2.5 on PinchBench, which Kilo frames as a jump into the top tier of coding models benchmark details.
  • The practical tradeoff is also clear in Kilo's launch writeup: M2.7 tends to do more reading and analysis before coding, which helped it solve tasks "zero other models could crack" but can also mean more tokens and occasional timeouts launch thread.
  • Early community tests are leaning into that agentic pattern rather than one-shot prompting: one developer paired Opus 4.6 as planner/reviewer with four M2.7 workers and a five-iteration loop to build Three.js and voxel-art demos, claiming the run still cost less than Sonnet 4.6 agent demo looping test.

What did Kilo actually measure?

Kilo's benchmark writeup says MiniMax M2.7 scored 86.2% on PinchBench, a benchmark for agentic coding tasks, putting it fifth among 50 tested models and 1.2 points behind Opus 4.6. The same post says it outperformed several established competitors while improving 3.7 points over M2.5.

The second benchmark matters more for autonomous coding behavior. In Kilo Bench, an 89-task eval, Kilo reports M2.7 finished second overall with a 47% pass rate and says the model often spends longer exploring a codebase before making changes. Kilo's launch thread claims that pattern let it solve tasks that no other tested model completed, but the writeup also says the same behavior can drive up token use and cause timeouts on harder jobs.

How are practitioners testing it in agent loops?

The first practitioner reports are less about raw benchmark rank and more about orchestration style. In one shared setup, the author says they used "Opus 4.6 as a reviewer/planner" with "4 worker minimax M2.7 agent" instances and a loop count of five to iteratively improve generated scenes workflow thread. The attached demo shows a Three.js-generated "premium interactive isometric 3D cozy room" built in a single HTML block with repeated refinement cozy room demo.

A follow-up voxel-art test used the same five-loop pattern and the author says to "always use minimax in agentic form," adding that the full run at five loops still cost less than Sonnet 4.6 cost claim. The evidence here is anecdotal, but it lines up with Kilo's benchmark note that M2.7 appears strongest when given room to explore, review, and iterate rather than being treated as a cheap one-pass coding model benchmark details.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
What did Kilo actually measure?1 post
How are practitioners testing it in agent loops?1 post
Share on X