METR says Claude Mythos Preview hits 16-hour p50 Horizon in early snapshot
METR said an early Claude Mythos Preview snapshot reached at least a 16-hour 50% time horizon, with only five tasks in-suite at that range. The result matters because Mythos is beyond METR's stable measurement band, so cross-model comparisons are less reliable.

TL;DR
- METR_Evals' headline result says an early Claude Mythos Preview snapshot reached a 50% task-completion time horizon of at least 16 hours, with a 95% confidence interval from 8.5 hours to 55 hours, during a limited March 2026 risk-assessment window.
- According to METR_Evals' caveat post, that result sits at the top of METR's current measurement range, because only 5 of 228 tasks in the suite are estimated at 16 hours or longer.
- On METR's stricter 80% success line, scaling01's chart post put Mythos Preview at about 3 hours 6 minutes, while alexalbert__'s comparison post said the early snapshot was more than 2x the next-best model on that view.
- The most concrete public proof point still comes from Firefox: daniel_mac8's Mozilla link post pointed to Mozilla's writeup saying Claude Mythos Preview contributed to 271 fixed vulnerabilities in Firefox 150, inside a larger month with 423 security fixes documented in Mozilla's post.
METR's own chart is doing two jobs at once. It shows a very strong result for Mythos, and it also shows the eval hitting its ceiling. You can read METR's methodology page, check Mozilla's behind-the-scenes Firefox writeup, and skim the Hacker News thread where Mozilla engineers clarified what counted as a vulnerability and when Mythos produced proof-of-concept cases.
The 16-hour ceiling
METR defined task-completion time horizon as the human task duration where an AI agent is predicted to succeed at a given reliability level, based on a logistic fit over software-heavy tasks on its time horizons page. In this release, METR_Evals' headline result put Claude Mythos Preview at at least 16 hours on the 50% line.
The weird bit is that 16 hours is not a clean new benchmark number. It is the point where METR starts truncating the headline because the current suite does not have enough long-duration tasks above that band.
The 3-hour 80% line
At 80% success, the result is easier to compare. scaling01's chart post read the point at about 3 hours 6 minutes, and alexalbert__'s comparison post said that early snapshot was more than 2x the next-best model on METR's 80% chart.
METR's public methodology page explains why the 80% curve matters alongside the 50% curve: the same task suite supports both views, but the higher-reliability line avoids some of the saturation effect now showing up at the top end. That is why the 3-hour number is the cleaner cross-model comparison, even though the 16-hour headline is the attention grabber.
The measurement caveat
According to METR_Evals' caution post, only 5 of 228 tasks in the current suite are estimated as 16 hours or longer. METR_Evals' robustness clarification adds that the suite could still distinguish a much more capable model from current public state of the art, but not robustly enough for precise quantitative comparisons or extrapolations.
That caveat lines up with METR's broader documentation. The current Time Horizon 1.1 suite on METR's methodology page draws from RE-Bench, HCAST, and shorter novel software tasks across software engineering, ML, and cybersecurity, with human completion times used to anchor task duration. It is a real benchmark, but it is also a benchmark that now needs longer tasks.
One unresolved question from the tweet cycle was whether METR had already extended the suite for Mythos. The answer appears to be no: METR_Evals' caution post says updated methods are still in development, so the public result is a ceiling claim, not a fully expanded next-generation measurement.
Firefox hardening
The best public grounding for Mythos capability is still Mozilla's own reporting, not the extrapolation memes. In Mozilla's writeup, the company said Claude Mythos Preview contributed to fixes for 271 vulnerabilities in Firefox 150 after Mozilla built its own harness on top of existing fuzzing infrastructure.
Mozilla's details are much more useful than the viral chart reactions. The post says earlier experiments with GPT-4 and Sonnet 3.5 produced too many false positives to scale, while agentic harnesses changed the process by creating and running reproducible test cases, then parallelizing hunts across ephemeral VMs by target file.
A few concrete claims from Mozilla's post:
- The April 2026 release cycle shipped 423 Firefox security fixes, a spike also visible in akbirkhan's Firefox fixes chart.
- Mozilla published a sample of bugs spanning JIT, WebAssembly GC, IPC races, sandbox escapes, XSLT, DNS parsing, and RLBox verification logic.
- Several sample findings involved long-lived bugs, including a 15-year-old element bug and a 20-year-old XSLT issue.
- In the Hacker News discussion, Mozilla engineers added that Mythos wrote PoCs for bugs that crashed with memory-unsafe behavior, while the team often fixed likely exploitable issues without spending extra time proving full exploitability.
That last clarification matters because it adds one new layer the METR chart cannot show: Mozilla is describing a harness that generates signal strong enough to plug directly into a live security pipeline, not just a benchmark run or a one-off demo.