ProgramBench reports 0% on ffmpeg, SQLite, and ripgrep rebuilds without internet
The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.

TL;DR
- deedydas' launch thread surfaced a new benchmark, ProgramBench, where agents get only an executable plus docs and must rebuild the whole program without internet; the official ProgramBench site says the first release has 200 tasks.
- According to the paper screenshot in deedydas' thread, every listed model scored 0.0% on the primary "resolved" metric, which counts a task only when all hidden behavioral tests pass.
- The same result table screenshot also shows the headline is harsher than the heatmap underneath: Claude Opus 4.7 reached 3.0% "almost resolved," meaning at least 95% of tests passed, while several models still landed at 0.0% there too.
- Critiques landed fast, with scaling01's post arguing the 100%-or-bust metric hides meaningful progress, while OfirPress's reply says the benchmark is intentionally optimized for full completion because the reference binaries pass every test.
- One important caveat came from paul_cal's critique and OfirPress's reply: the initial run used mini-SWE-agent, not Codex or Claude Code style harnesses, and the authors say multi-agent baselines are still coming.
You can browse the leaderboard, inspect the new GitHub repo, and check the odd little detail that the site already exposes an "almost resolved" column plus an extended-results view even while every top-line score reads 0%. The paper screenshot in deedydas' thread also claims the tasks span everything from compact CLI tools to FFmpeg, SQLite, and the PHP interpreter. Then the discourse immediately split into two useful arguments: whether 0% is the right headline, and whether a stripped-down harness is really the right way to bench frontier coding agents.
ProgramBench
ProgramBench is a clean-room rebuild test, not a bugfix eval. The repo README says agents get a compiled binary and its documentation, then must architect and implement a complete codebase that reproduces the original behavior, with links out to the website, paper, leaderboard, and usage guide in the new facebookresearch/programbench repository.
The official site says the public leaderboard uses mini-SWE-agent over 200 tasks and scores submissions with hidden behavioral tests. The paper abstract shown in the screenshot adds one more useful detail: the test suites were generated via agent-driven fuzzing, so the eval can check end-to-end behavior without prescribing implementation structure.
The 0% scoreboard
The benchmark's main number is brutally simple. On the official leaderboard, "resolved" means every hidden behavioral test passed for a task, and the first public table shows 0% for every model.
The paper screenshot in deedydas' thread lists nine evaluated models and puts Claude Opus 4.7 at the top of the "almost resolved" column with 3.0%, ahead of Opus 4.6 at 2.5% and Sonnet 4.6 at 1.6%. GPT 5.4, GPT 5.4 mini, GPT 5 mini, Gemini 3.1 Pro, Gemini 3 Flash, and Claude Haiku 4.5 all show 0.0% on that looser measure in the screenshot.
That same abstract includes the most interesting failure mode in one sentence: models tend to produce monolithic, single-file implementations that diverge sharply from the human-written reference code. For an eval aimed at long-horizon software synthesis, that is a more revealing detail than the all-zero headline.
Almost resolved
The site does not actually hide partial progress, it just demotes it. The leaderboard defines "almost resolved" as solving at least 95% of behavioral tests, and OfirPress's reply says they gave it a full column because they are not trying to conceal it.
The disagreement is about what the benchmark should optimize for:
- scaling01's critique argues the 0% headline is not very informative when Opus-class models can still pass large chunks of each task's tests.
- OfirPress's reply says the benchmark should anchor on 100% completion because the reference binaries pass all tests.
- deedydas' follow-up argues this is still a useful stress test even if people also want progress-based rankings.
That tension feels real. ProgramBench looks strongest as a ceiling-seeking benchmark for end-to-end completion, and weaker as a direct proxy for whether an agent was useful during a messy real coding session.
Harness gaps
The other live argument is about the agent harness, not the models. paul_cal's post called out the absence of Codex and Claude Code style baselines, plus the lack of context management or compaction in the initial setup.
So far, the concrete facts are short:
- OfirPress's reply says Claude Code, Codex, and mini-SWE-agent get roughly similar scores on TerminalBench and SWE-Bench.
- The same reply says "this is not a harness issue" for those existing benchmarks.
- OfirPress's reply also says multi-agent baselines are coming soon.
- deedydas' later thread treats the missing harness comparisons as fair criticism, but argues they can simply be added.
That leaves ProgramBench in an interesting early state: the benchmark shipped with a hard, concrete task design and a deliberately unforgiving completion metric, but the strongest argument against over-reading the first scoreboard is that the frontier agent setups people actually care about have not been run on it yet.