TOPIC16 stories

Reinforcement Learning

RL, RFT, and environment-driven training for agent behavior.

Stories

DeepReinforce releases Ornith-1.0 397B MoE with 82.4 SWE-Bench Verified

DeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.

NEWS1w ago

OpenAI reports beneficial RL improves 44 of 53 evals and transfers beyond health

OpenAI said reinforcement learning on realistic conversations improved 44 of 53 alignment and benefit evaluations, including transfer from health-only training to deception and reward-hacking tests. The result suggests a broader behavioral shift rather than narrow task tuning, but the claim is based on OpenAI’s own eval mix rather than a single public benchmark.

NEWS3w ago

Researchers benchmark AutoLab, SkillOpt, and Meta-Agent Challenge for self-improving agents

New papers tested whether agents can improve code, skills, or other agents without heavy human guidance. The results favor persistence, critique, and small targeted edits over one-shot brilliance, but they still show clear limits.

RELEASE4w ago

Trajectory launches continual-learning platform with off-policy SDPO

Trajectory launched a platform that turns agent traces and user corrections into post-deployment model updates instead of prompt-only fixes. Baseten and Tinker described live A/B post-training, 397B-model deployment work, and an off-policy recipe for stabilizing the loop.

RELEASE1mo ago

Ramp Sheets launches Fast Ask RL subagent with +4% exact-match gain over Opus at Haiku latency

Ramp and Prime Intellect launched Fast Ask, a small RL-trained spreadsheet retrieval subagent for Ramp Sheets. Ramp says it beats Opus by 4% exact match while running at Haiku latency, showing how narrow RL-trained agents can outperform larger frontier models on repetitive enterprise tasks.

RELEASE1mo ago

Zyphra releases ZAYA1-8B with <1B active params and Markovian RSA reasoning

Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.

RELEASE1mo ago

ml-intern adds YOLO mode and Hub session sync for long-running post-training runs

ml-intern now lets an agent run long post-training tasks like parallel ablations in YOLO mode and automatically pushes session traces to a Hub account for later inspection. That gives RL and fine-tuning workflows both unattended execution and a built-in audit trail.

RELEASE1mo ago

Qwen-Scope releases SAE toolkit for Qwen3.5-27B steering

Alibaba’s Qwen team released Qwen-Scope, an open sparse-autoencoder suite for Qwen3.5-27B that can steer outputs, surface repetition features, and compare benchmark feature overlap. The toolkit turns interpretability artifacts into debugging, data-generation, and evaluation workflows.

RELEASE3mo ago

Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.

NEWS3mo ago

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.

RELEASE3mo ago

NVIDIA releases Nemotron-Cascade 2 30B-A3 with IMO gold-level claims and Ollama support

NVIDIA published Nemotron-Cascade 2, a 30B MoE with 3B active parameters, claiming IMO gold-level math and Kimi K2.5-class code scores, then pushed it to Hugging Face and Ollama. It is worth testing if you want an open agent model with immediate local and hosted paths.

NEWS3mo ago

Mistral launches Forge for enterprise model training on private data with pretrain and RL

Mistral introduced Forge, a platform for enterprises to pre-train, post-train, and reinforce models on internal code, policies, and operational data, including on-prem deployments. Consider it when retrieval alone is not enough and you need weights tuned to private workflows.

RELEASE3mo ago

H Company releases Holotron-12B: 8.9k tok/s on H100 and 80.5% WebVoyager

H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.

RELEASE3mo ago

OpenClaw-RL releases fully asynchronous online training with OPD for live agents

OpenClaw-RL released a fully asynchronous online training stack that turns live interaction feedback into ongoing agent updates with binary rewards and token-level OPD corrections. Use it as a starting point for online agent improvement only if you can score rollouts reliably and manage privacy risk.

NEWS3mo ago

UT Austin compares Seq. FT + LoRA vs RL for VLA continual learning

UT Austin researchers report that simple sequential fine-tuning with LoRA and on-policy RL can retain prior skills while learning new VLA tasks. Try this baseline before reaching for more complex continual-learning methods.

NEWS3mo ago

OpenClaw-RL reports continuous agent training from user corrections and next-state signals

The OpenClaw-RL paper proposes training agents continuously from normal interactions by turning user corrections, logs, and next-state feedback into rewards and word-level supervision. Watch it if you build persistent agents and want adaptation to come from live deployment traces instead of offline labeling.