Epoch releases MirrorCode with 25 long-horizon SWE tasks and a 56% score
Epoch introduced MirrorCode, a benchmark where models reimplement real programs from specs with no internet and hidden held-out tests; the best current score is 56%. The setup matters because it scales inference into multi-day runs and targets software jobs estimated to take humans weeks.

TL;DR
- Epoch AI's new MirrorCode launch thread frames software reimplementation as a long-horizon benchmark, and Epoch AI's score post says the best current headline score is 56%, from Claude Opus 4.7.
- In MirrorCode, models get only a program's behavior, docs, and tests, because Epoch AI's setup post says they must rebuild the software without source access, while Epoch AI's anti-cheating post adds hidden held-out tests and no internet.
- The benchmark spans 25 real programs, because Epoch AI's task list says the targets range from Unix utilities to cryptography, and Epoch AI's release post says 22 of those 25 are being open-sourced.
- Epoch's most vivid example is Epoch AI's gotree example, where Opus 4.7 reimplemented a 16,000 line bioinformatics toolkit in 14 hours for $251, on work Epoch estimates would take a human engineer 2 to 17 weeks.
- MirrorCode also pushes inference budgets far past normal SWE evals, because Epoch AI's inference-budget post says one run lasted 19 days and cost $2,600, while Epoch AI's benchmark-design post argues that scale is necessary to measure current limits.
You can jump from Epoch's leaderboard and full paper to Ethan Mollick's paper screenshot, which includes the abstract and chart, and the oddest detail in the whole release is probably Epoch AI's language post, where performance barely moved across languages, including Ada.
Benchmark design
MirrorCode is narrower than "general coding agent" evals and more ambitious than short-task bugfix suites. Epoch AI's setup post says the model gets execute-only access to a program, its docs, and tests showing intended behavior, then has to reimplement the whole thing from scratch.
That setup creates three constraints Epoch keeps emphasizing:
- No original source code, according to Epoch AI's setup post.
- No internet access and no scorer hacking, according to Epoch AI's anti-cheating post.
- Hidden held-out tests during development, according to Epoch AI's anti-cheating post.
Epoch's claim is that this makes the benchmark both hard and realistically solvable. Epoch AI's solvability post contrasts MirrorCode with reverse-engineering style evals that can hinge on undocumented behavior, while Epoch AI's benchmark-design post reduces its pitch to three properties: difficult, scalable, and resistant to cheating.
25 programs, a 56% ceiling, and one standout run
The task set is broad enough to avoid becoming a thin wrapper around web app CRUD. Epoch AI's task list says the 25 targets cover Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.
The headline number is still far from saturation. Epoch AI's score post says Claude Opus 4.7 leads the benchmark at 56%, and adds that failed runs often still clear 90% or more of tests before breaking on edge cases.
Epoch's showcase example is gotree. Epoch AI's gotree example says Opus 4.7 passed 2,000 of 2,001 tests, or 99.95%, while rebuilding a 16,000 line Go bioinformatics toolkit with more than 40 commands. Epoch estimates that same job would take a human engineer 2 to 17 weeks, versus 14 hours and $251 for the model run.
Mollick's summary mostly tracks the release, but his post is useful because it surfaces the paper screenshots and makes the benchmark's core bet plain: measure the largest end-to-end coding jobs models can finish autonomously, not just whether they can patch a repo in one sitting.
Long runs and bigger budgets
MirrorCode is partly a benchmark design argument about spending. Epoch AI's inference-budget post says many SWE benchmarks cap inference at roughly $1 to $10 per task, even when the underlying work would take a skilled human weeks.
Epoch instead lets runs stretch into multi-day or multi-week territory. The same Epoch AI inference-budget post says one of the longest attempts ran for 19 days and cost $2,600.
That changes what the benchmark is actually testing. Short-horizon evals mostly measure whether a model can find the right local move quickly. MirrorCode is trying to measure whether an agent can stay coherent long enough to ship a real replacement program when the search process itself is expensive.
Open release and language coverage
Epoch is not keeping most of the benchmark private. Epoch AI's release post says 22 of the 25 programs are being released as open source, while three stay held out.
That held-out split matters because the benchmark is explicitly built around anti-overfitting, and Epoch's leaderboard and paper page is where Epoch says future results will live. The same Epoch AI's paper-and-leaderboard post notes MirrorCode was co-developed with METR and supported by a METR grant.
Epoch also used the benchmark to probe something more specific than the launch headline. Epoch AI's language post says MirrorCode showed little sign that programming language meaningfully changed model performance, even for obscure languages like Ada.