Epoch AI and METR introduced MirrorCode, a long-horizon benchmark where models reimplement software from execution-only access; Opus 4.6 completed a 16,000-line bioinformatics toolkit. The authors say oracle tests and memorization risks still limit how directly the result maps to everyday software work.

You can read the full preliminary writeup, inspect the target project gotree, and compare MirrorCode's "weeks-long" framing with METR's separate time horizons work. The fun detail is how benchmark-y this setup gets: visible tests, held-out dual tests to catch hard-coding, no web access, and a 1 billion token ceiling that still was not enough to finish Pkl.
MirrorCode is built around one concrete task: reimplement an existing CLI program exactly enough to match the original program's behavior. According to Epoch AI's setup tweet, the agent can run the original binary and see visible test cases, but cannot read the source.
The official methodology fills in the rest of the scaffold. Agents also get high-level documentation, run inside a Docker sandbox, use a simple ReAct-style scaffold from Inspect, and are blocked from internet access and from wrapping the original binary.
Epoch says the full benchmark has more than 20 targets across Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression. The post says MirrorCode will be released as open source, with a private test set held back.
The headline result is gotree, an open source bioinformatics toolkit for phylogenetic trees. Epoch AI's result tweet says Opus 4.6 reimplemented it successfully, and the blog post puts the original size at 16,905 lines of Go, with 2,001 end-to-end tests.
The same table says Opus 4.6's successful Rust version came in at 7,644 lines. That matters mostly as a complexity marker, because Epoch uses the reimplementation's line count to compare tasks across target languages.
MirrorCode's current analyzed set has four named programs:
Epoch says smaller targets like choose and cal were solvable by older Claude models, while larger ones like gotree only fell to more recent models.
MirrorCode is more interesting where it breaks than where it passes. Epoch AI's thread says Pkl was still making progress when the experiment stopped, and the official post identifies Pkl as a Java and Kotlin codebase with 61,461 original lines and 770 end-to-end tests.
The methodology section says runs used compaction so trajectories could outlive context limits, and that the team has explored budgets up to 1 billion tokens per task. In their setup, that worked out to roughly $550 for a single run.
Epoch also says older models had a habit of submitting too early, even when tests were still failing. That makes MirrorCode feel less like a unit-test benchmark and more like a perseverance benchmark, which is catnip for anyone tracking where coding agents actually fail.
Epoch is unusually direct about the benchmark's blind spots. One caveat tweet says real software rarely comes with an oracle program and pre-existing tests, which is the whole trick that makes MirrorCode cleanly measurable.
The methodology section shows how the team tried to keep the benchmark from turning into test-set memorization or output hard-coding. For some exposed tests, it pairs hidden "dual" tests with different values, so a model that hard-codes February 1983 for cal still fails on a different year.
Epoch AI's requirements note argues that parts of software engineering already do look like this, because specs, metrics, and tests can give agents a crisp feedback loop. Its memorization note adds the other unresolved problem: filtering memorized target programs helps, but does not close the case.
These are explicitly preliminary numbers. Epoch AI's final thread post says the team is still running larger experiments and adding other models for the full release.
The official post says 24 target programs have already been manually selected, even though this writeup only analyzes four of them in detail. It also says early testing found other models were comparable or weaker on the tasks tried so far.
Epoch AI's credits post says MirrorCode is led by Tamay Kadamcz at Epoch AI, in collaboration with and funded by METR, with Kadamcz and David Rein listed as core contributors. That funding and benchmark lineage are useful context on their own, because MirrorCode is arriving as part of a broader METR and Epoch push to measure longer autonomous task horizons rather than one-shot coding demos.