breakingApril 10, 2026

MirrorCode benchmarks Claude Opus 4.6 on a 16,000-line software reimplementation

Epoch AI and METR introduced MirrorCode, a long-horizon benchmark where models reimplement software from execution-only access; Opus 4.6 completed a 16,000-line bioinformatics toolkit. The authors say oracle tests and memorization risks still limit how directly the result maps to everyday software work.

5 min read

MirrorCode benchmarks Claude Opus 4.6 on a 16,000-line software reimplementation

TL;DR

Epoch AI's launch thread introduced MirrorCode as a long-horizon benchmark for software reimplementation, and MirrorCode's preliminary results say Claude Opus 4.6 fully rebuilt gotree, a bioinformatics CLI with about 16,000 lines of Go and more than 40 commands.
In Epoch AI's task description, agents get execute-only access to the original program plus visible tests, while the official post adds that they also get high-level docs but no source code or internet access.
Epoch AI's gotree result estimates the successful reimplementation would take an unassisted human engineer 2 to 17 weeks, and the blog post says Opus 4.6 solved almost every target up to gotree's size in the current suite.
Epoch AI's Pkl note says the hardest task, reimplementing Pkl, was still improving when the run hit its token limit, and the methodology section says the team explored budgets up to 1 billion tokens, about $550 per task in its setup.
Epoch AI's caveat tweet and its memorization warning both narrow the claim: MirrorCode measures work against a precise oracle-style specification, and the authors say memorization defenses are imperfect.

You can read the full preliminary writeup, inspect the target project gotree, and compare MirrorCode's "weeks-long" framing with METR's separate time horizons work. The fun detail is how benchmark-y this setup gets: visible tests, held-out dual tests to catch hard-coding, no web access, and a 1 billion token ceiling that still was not enough to finish Pkl.

MirrorCode

MirrorCode is built around one concrete task: reimplement an existing CLI program exactly enough to match the original program's behavior. According to Epoch AI's setup tweet, the agent can run the original binary and see visible test cases, but cannot read the source.

The official methodology fills in the rest of the scaffold. Agents also get high-level documentation, run inside a Docker sandbox, use a simple ReAct-style scaffold from Inspect, and are blocked from internet access and from wrapping the original binary.

Epoch says the full benchmark has more than 20 targets across Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression. The post says MirrorCode will be released as open source, with a private test set held back.

gotree

The headline result is gotree, an open source bioinformatics toolkit for phylogenetic trees. Epoch AI's result tweet says Opus 4.6 reimplemented it successfully, and the blog post puts the original size at 16,905 lines of Go, with 2,001 end-to-end tests.

The same table says Opus 4.6's successful Rust version came in at 7,644 lines. That matters mostly as a complexity marker, because Epoch uses the reimplementation's line count to compare tasks across target languages.

MirrorCode's current analyzed set has four named programs:

choose, a string manipulation tool similar to cut or awk
cal, the terminal calendar utility
gotree, the solved large bioinformatics target
Pkl, the unsolved configuration language target

Epoch says smaller targets like choose and cal were solvable by older Claude models, while larger ones like gotree only fell to more recent models.

Pkl and the token wall

MirrorCode is more interesting where it breaks than where it passes. Epoch AI's thread says Pkl was still making progress when the experiment stopped, and the official post identifies Pkl as a Java and Kotlin codebase with 61,461 original lines and 770 end-to-end tests.

The methodology section says runs used compaction so trajectories could outlive context limits, and that the team has explored budgets up to 1 billion tokens per task. In their setup, that worked out to roughly $550 for a single run.

Epoch also says older models had a habit of submitting too early, even when tests were still failing. That makes MirrorCode feel less like a unit-test benchmark and more like a perseverance benchmark, which is catnip for anyone tracking where coding agents actually fail.

Caveats in the oracle setup

Epoch is unusually direct about the benchmark's blind spots. One caveat tweet says real software rarely comes with an oracle program and pre-existing tests, which is the whole trick that makes MirrorCode cleanly measurable.

The methodology section shows how the team tried to keep the benchmark from turning into test-set memorization or output hard-coding. For some exposed tests, it pairs hidden "dual" tests with different values, so a model that hard-codes February 1983 for cal still fails on a different year.

Epoch AI's requirements note argues that parts of software engineering already do look like this, because specs, metrics, and tests can give agents a crisp feedback loop. Its memorization note adds the other unresolved problem: filtering memorized target programs helps, but does not close the case.

Release plan and scope

These are explicitly preliminary numbers. Epoch AI's final thread post says the team is still running larger experiments and adding other models for the full release.

The official post says 24 target programs have already been manually selected, even though this writeup only analyzes four of them in detail. It also says early testing found other models were comparable or weaker on the tasks tried so far.

Epoch AI's credits post says MirrorCode is led by Tamay Kadamcz at Epoch AI, in collaboration with and funded by METR, with Kadamcz and David Rein listed as core contributors. That funding and benchmark lineage are useful context on their own, because MirrorCode is arriving as part of a broader METR and Epoch push to measure longer autonomous task horizons rather than one-shot coding demos.