Reinforcement Learning
RL, RFT, and environment-driven training for agent behavior.
Stories
Filter storiesDeepReinforce released Ornith-1.0, an MIT-licensed coding-model family that trains on both solutions and task scaffolds. The flagship 397B MoE claims 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, pushing open coding models closer to closed frontier systems.
OpenAI said reinforcement learning on realistic conversations improved 44 of 53 alignment and benefit evaluations, including transfer from health-only training to deception and reward-hacking tests. The result suggests a broader behavioral shift rather than narrow task tuning, but the claim is based on OpenAI’s own eval mix rather than a single public benchmark.
New papers tested whether agents can improve code, skills, or other agents without heavy human guidance. The results favor persistence, critique, and small targeted edits over one-shot brilliance, but they still show clear limits.
Trajectory launched a platform that turns agent traces and user corrections into post-deployment model updates instead of prompt-only fixes. Baseten and Tinker described live A/B post-training, 397B-model deployment work, and an off-policy recipe for stabilizing the loop.
Ramp and Prime Intellect launched Fast Ask, a small RL-trained spreadsheet retrieval subagent for Ramp Sheets. Ramp says it beats Opus by 4% exact match while running at Haiku latency, showing how narrow RL-trained agents can outperform larger frontier models on repetitive enterprise tasks.
Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.
ml-intern now lets an agent run long post-training tasks like parallel ablations in YOLO mode and automatically pushes session traces to a Hub account for later inspection. That gives RL and fine-tuning workflows both unattended execution and a built-in audit trail.
Alibaba’s Qwen team released Qwen-Scope, an open sparse-autoencoder suite for Qwen3.5-27B that can steer outputs, surface repetition features, and compare benchmark feature overlap. The toolkit turns interpretability artifacts into debugging, data-generation, and evaluation workflows.
Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.
Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.
NVIDIA published Nemotron-Cascade 2, a 30B MoE with 3B active parameters, claiming IMO gold-level math and Kimi K2.5-class code scores, then pushed it to Hugging Face and Ollama. It is worth testing if you want an open agent model with immediate local and hosted paths.
Mistral introduced Forge, a platform for enterprises to pre-train, post-train, and reinforce models on internal code, policies, and operational data, including on-prem deployments. Consider it when retrieval alone is not enough and you need weights tuned to private workflows.
H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.
OpenClaw-RL released a fully asynchronous online training stack that turns live interaction feedback into ongoing agent updates with binary rewards and token-level OPD corrections. Use it as a starting point for online agent improvement only if you can score rollouts reliably and manage privacy risk.
UT Austin researchers report that simple sequential fine-tuning with LoRA and on-policy RL can retain prior skills while learning new VLA tasks. Try this baseline before reaching for more complex continual-learning methods.
The OpenClaw-RL paper proposes training agents continuously from normal interactions by turning user corrections, logs, and next-state feedback into rewards and word-level supervision. Watch it if you build persistent agents and want adaptation to come from live deployment traces instead of offline labeling.