Skip to content
AI Primer
release

Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score

Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

5 min read
Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score
Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score

TL;DR

You can jump from Epoch's leaderboard and full paper to Ethan Mollick's paper screenshot, which includes the abstract and chart, and the oddest detail in the whole release is probably Epoch AI's language post, where performance barely moved across languages, including Ada.

Benchmark design

MirrorCode is narrower than "general coding agent" evals and more ambitious than short-task bugfix suites. Epoch AI's setup post says the model gets execute-only access to a program, its docs, and tests showing intended behavior, then has to reimplement the whole thing from scratch.

That setup creates three constraints Epoch keeps emphasizing:

Epoch's claim is that this makes the benchmark both hard and realistically solvable. Epoch AI's solvability post contrasts MirrorCode with reverse-engineering style evals that can hinge on undocumented behavior, while Epoch AI's benchmark-design post reduces its pitch to three properties: difficult, scalable, and resistant to cheating.

25 programs, a 56% ceiling, and one standout run

The task set is broad enough to avoid becoming a thin wrapper around web app CRUD. Epoch AI's task list says the 25 targets cover Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

The headline number is still far from saturation. Epoch AI's score post says Claude Opus 4.7 leads the benchmark at 56%, and adds that failed runs often still clear 90% or more of tests before breaking on edge cases.

Epoch's showcase example is gotree. Epoch AI's gotree example says Opus 4.7 passed 2,000 of 2,001 tests, or 99.95%, while rebuilding a 16,000 line Go bioinformatics toolkit with more than 40 commands. Epoch estimates that same job would take a human engineer 2 to 17 weeks, versus 14 hours and $251 for the model run.

Mollick's summary mostly tracks the release, but his post is useful because it surfaces the paper screenshots and makes the benchmark's core bet plain: measure the largest end-to-end coding jobs models can finish autonomously, not just whether they can patch a repo in one sitting.

Long runs and bigger budgets

MirrorCode is partly a benchmark design argument about spending. Epoch AI's inference-budget post says many SWE benchmarks cap inference at roughly $1 to $10 per task, even when the underlying work would take a skilled human weeks.

Epoch instead lets runs stretch into multi-day or multi-week territory. The same Epoch AI inference-budget post says one of the longest attempts ran for 19 days and cost $2,600.

That changes what the benchmark is actually testing. Short-horizon evals mostly measure whether a model can find the right local move quickly. MirrorCode is trying to measure whether an agent can stay coherent long enough to ship a real replacement program when the search process itself is expensive.

Open release and language coverage

Epoch is not keeping most of the benchmark private. Epoch AI's release post says 22 of the 25 programs are being released as open source, while three stay held out.

That held-out split matters because the benchmark is explicitly built around anti-overfitting, and Epoch's leaderboard and paper page is where Epoch says future results will live. The same Epoch AI's paper-and-leaderboard post notes MirrorCode was co-developed with METR and supported by a METR grant.

Epoch also used the benchmark to probe something more specific than the launch headline. Epoch AI's language post says MirrorCode showed little sign that programming language meaningfully changed model performance, even for obscure languages like Ada.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR2 posts
Benchmark design1 post
25 programs, a 56% ceiling, and one standout run1 post
Share on X