breakingJune 6, 2026

Kilo Code benchmarks MiniMax M3 vs Claude Opus 4.8: 13/17 bugs at $0.07 vs $1.30

A seeded code-audit benchmark found MiniMax M3 and the cheapest Claude Opus 4.8 run each caught 13 of 17 planted bugs, but at sharply different cost. The results also showed models found different bugs, and higher reasoning settings did not reliably improve cost efficiency.

4 min read

Kilo Code benchmarks MiniMax M3 vs Claude Opus 4.8: 13/17 bugs at $0.07 vs $1.30

TL;DR

In Kilo Code's seeded audit, kilocode's benchmark opener said MiniMax M3 and the cheapest Claude Opus 4.8 run each found 13 of 17 planted bugs, but M3 cost $0.07 versus $1.30.
According to kilocode's coverage update and kilocode's reply on totals, Claude Opus 4.8 at xhigh and max led overall coverage with 15 of 17, while M3 matched Claude medium and high on raw count.
As kilocode's follow-up on different bug coverage noted, the 13 bug tie hid different coverage: M3 caught a secret-returning endpoint that Claude medium missed, while Claude medium caught an async transaction bug that M3 missed.
kilocode's reasoning-cost comparison showed that turning Claude's reasoning up did not improve linearly: max cost $3.39, 67% more than xhigh, and delivered no better bug count.
Kilo Code's full benchmark writeup, Anthropic's Opus 4.8 pricing page, and MiniMax's M3 pricing page all line up with the same basic story: coverage moved by a couple of bugs, cost moved by orders of magnitude.

You can read Kilo Code's full writeup, compare Anthropic's $5 in, $25 out Opus 4.8 pricing, and check MiniMax's $0.60 in, $2.40 out M3 pricing. The weirdest detail came from kilocode's follow-up, which clarified that the tied 13 bug runs did not find the same 13 bugs. kilocode's cost breakdown also showed Claude max billing more than xhigh on slightly fewer tokens.

Benchmark setup

Kilo used a production-like webhook delivery service in TypeScript, Bun, and SQLite, with 17 pre-cataloged issues as the answer key, according to kilocode's fixture and prompt and the full benchmark writeup.

Each run got the same audit prompt, no file edits, and a fresh CLI session. Claude Opus 4.8 ran at medium, high, xhigh, and max, while MiniMax M3 ran once at its default setting, per kilocode's fixture and prompt. Kilo says it counted an issue only when the model named it explicitly in its report.

Bug coverage

Every run found the big obvious failures, including missing auth on every route, unsafe outbound requests, a non-constant-time signature check, duplicate-send risk in the worker, and missing idempotency, according to kilocode's blocker list.

The more interesting split showed up in the narrower bugs. Kilo said M3 still caught the secret-returning endpoint, the combined-filter bug, and the replay-path state bug in its 13 of 17 result, while kilocode's follow-up on different bug coverage added that Claude medium found an async transaction bug that M3 missed.

That is why the headline tie is a little crooked. kilocode's benchmark opener gave both cheap runs a 13, but kilocode's follow-up on different bug coverage said they were not the same 13.

Reasoning levels

Claude's higher-effort settings did buy more coverage, just not in a clean staircase. kilocode's coverage update and kilocode's reply on totals put xhigh and max at 15 of 17, ahead of medium, high, and M3 at 13.

But the expensive end looked sloppy on efficiency. kilocode's reasoning-cost comparison said medium and high both caught an async transaction bug that xhigh and max missed, while max cost $3.39, 67% more than xhigh, for no better total. kilocode's cost-per-issue chart framed it the blunt way: MiniMax M3 had the lowest cost per issue, and Claude max had the highest.

The price gap started before any reasoning setting came into play. Kilo's pricing slide in kilocode's pricing comparison matched the official vendor docs, with Opus 4.8 listed at $5 per million input tokens and $25 per million output tokens on Anthropic's pricing page, versus MiniMax M3 at $0.60 and $2.40 on MiniMax's pay-as-you-go page.

Timing and hosting

MiniMax's edge in this test was price, not speed. The benchmark writeup put M3 at 5m 03s, slower than Claude medium and high but faster than Claude xhigh and max.

Kilo also surfaced a hosting caveat that does not fit in the headline chart. Its writeup said MiniMax plans to release M3 weights publicly, so the reported runtime reflects current hosting rather than some fixed lower bound; if other providers pick it up later, throughput could change.

Inside the Claude runs, the timing curve kept stretching as effort increased. Kilo's writeup put medium at 3m 53s, xhigh at 7m 26s, and max at 9m 24s, nearly triple the medium run, while kilocode's reasoning-cost comparison showed the extra wait at max still did not buy a better report than xhigh.

TL;DR

Benchmark setup

Bug coverage

Reasoning levels

Timing and hosting

Discussion across the web