Benchmark for program synthesis
Meta research benchmark for evaluating program synthesis and code-generation models.
The SWE-Bench team released ProgramBench, which asks models to rebuild real software from executables alone, and the initial complete-pass score is 0% across models. It matters as a harsher long-horizon coding benchmark, though its all-tests-pass metric and simpler harness make it a stress test rather than a direct proxy for production agents.