releaseJune 25, 2026

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.

4 min read

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

TL;DR

DeepReinforce shipped Ornith-1.0 as an MIT-licensed coding-model family, with a 397B MoE flagship and a 9B dense model, according to testingcatalog's launch post and rohanpaul_ai's release summary.
The main technical twist is joint reinforcement learning over both the code answer and the task scaffold, where the model first proposes a better workflow and then solves the task with it, as described in testingcatalog's training thread and rohanpaul_ai's follow-up.
DeepReinforce's reported headline numbers put Ornith-1.0-397B at 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, ahead of Claude Opus 4.7 on both charts in ai_for_success's benchmark post and rohanpaul_ai's benchmark summary.
The smaller story is the 9B model: testingcatalog's thread says it reaches 69.4 on SWE-Bench Verified and 43.1 on Terminal-Bench 2.1, which is unusually high for a compact open coding model.

You can jump from the official write-up to the Hugging Face weights, and the benchmark chart in ai_for_success's post is broader than the two headline scores, adding SWE-Bench Pro, multilingual, NL2Repo, Claw-eval, and two SWE Atlas slices.

Scaffold RL

Ornith's claim to novelty is not just more RL on coding tasks. The model is trained to improve the scaffold around a task, then produce the solution with that scaffold, with reward flowing back through both stages.

In practice, the scaffold covers the agent behaviors that usually live outside the base model: planning, memory pattern, routing, tool rhythm, retry logic, error handling, and search process, according to rohanpaul_ai's release summary and rohanpaul_ai's follow-up.

That makes Ornith a direct shot at one of the usual open-model bottlenecks in agentic coding. Instead of treating the harness as fixed and only tuning the answerer, DeepReinforce is training the model to discover a better harness for each class of repo or terminal task.

Benchmark table

The flagship chart attached to ai_for_success's benchmark post lists eight evals for Ornith-1.0-397B:

Terminal-Bench 2.1: 77.5
SWE-Bench Verified: 82.4
SWE-Bench Pro: 62.2
SWE-Bench Multilingual: 78.9
NL2Repo: 48.2
Claw-eval Avg: 77.1
SWE Atlas, QnA: 41.2
SWE Atlas, TW: 39.1

The attention grabber is the first two rows. rohanpaul_ai's benchmark summary says the model beats Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified, which is why this release immediately landed in the "open models are catching up" discourse.

Those numbers are still vendor-reported. What makes them interesting is the spread, not just the headline. DeepReinforce is presenting Ornith as a model family tuned for agentic coding across patching, multilingual code work, repo navigation, and tool-using terminal tasks, not as a one-benchmark stunt.

9B dense model

The 9B dense variant is the other reason this launch matters. testingcatalog's training thread says it posts 69.4 on SWE-Bench Verified and 43.1 on Terminal-Bench 2.1, while rohanpaul_ai's release summary frames that as competitive with much larger open models.

That is a different product bet from the 397B MoE. The flagship is for frontier-level open coding claims. The 9B is for people who want the same training recipe in a model small enough to make edge deployment part of the pitch.

Weights and lineage

DeepReinforce says Ornith is fully open source under MIT, with weights available through Hugging Face and the method described in the official blog post.

The release is also not a from-scratch base model effort. rohanpaul_ai's release summary says Ornith is built on top of pretrained Gemma 4 and Qwen 3.5, which places the novelty squarely in the post-training recipe rather than in a new foundation model stack.