Skip to content
AI Primer
TOPIC12 stories

Reinforcement Learning

RL, RFT, and environment-driven training for agent behavior.

RELEASE7th May
Ramp Sheets launches Fast Ask RL subagent with +4% exact-match gain over Opus at Haiku latency

Ramp and Prime Intellect launched Fast Ask, a small RL-trained spreadsheet retrieval subagent for Ramp Sheets. Ramp says it beats Opus by 4% exact match while running at Haiku latency, showing how narrow RL-trained agents can outperform larger frontier models on repetitive enterprise tasks.

RELEASE1w ago
Zyphra releases ZAYA1-8B with <1B active params and Markovian RSA reasoning

Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.

RELEASE1w ago
ml-intern adds YOLO mode and Hub session sync for long-running post-training runs

ml-intern now lets an agent run long post-training tasks like parallel ablations in YOLO mode and automatically pushes session traces to a Hub account for later inspection. That gives RL and fine-tuning workflows both unattended execution and a built-in audit trail.

RELEASE2w ago
Qwen-Scope releases SAE toolkit for Qwen3.5-27B steering

Alibaba’s Qwen team released Qwen-Scope, an open sparse-autoencoder suite for Qwen3.5-27B that can steer outputs, surface repetition features, and compare benchmark feature overlap. The toolkit turns interpretability artifacts into debugging, data-generation, and evaluation workflows.

RELEASE1mo ago
Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.

NEWS1mo ago
Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.

RELEASE1mo ago
NVIDIA releases Nemotron-Cascade 2 30B-A3 with IMO gold-level claims and Ollama support

NVIDIA published Nemotron-Cascade 2, a 30B MoE with 3B active parameters, claiming IMO gold-level math and Kimi K2.5-class code scores, then pushed it to Hugging Face and Ollama. It is worth testing if you want an open agent model with immediate local and hosted paths.

NEWS1mo ago
Mistral launches Forge for enterprise model training on private data with pretrain and RL

Mistral introduced Forge, a platform for enterprises to pre-train, post-train, and reinforce models on internal code, policies, and operational data, including on-prem deployments. Consider it when retrieval alone is not enough and you need weights tuned to private workflows.

RELEASE1mo ago
H Company releases Holotron-12B: 8.9k tok/s on H100 and 80.5% WebVoyager

H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.

RELEASE2mo ago
OpenClaw-RL releases fully asynchronous online training with OPD for live agents

OpenClaw-RL released a fully asynchronous online training stack that turns live interaction feedback into ongoing agent updates with binary rewards and token-level OPD corrections. Use it as a starting point for online agent improvement only if you can score rollouts reliably and manage privacy risk.

NEWS2mo ago
UT Austin compares Seq. FT + LoRA vs RL for VLA continual learning

UT Austin researchers report that simple sequential fine-tuning with LoRA and on-policy RL can retain prior skills while learning new VLA tasks. Try this baseline before reaching for more complex continual-learning methods.

NEWS2mo ago
OpenClaw-RL reports continuous agent training from user corrections and next-state signals

The OpenClaw-RL paper proposes training agents continuously from normal interactions by turning user corrections, logs, and next-state feedback into rewards and word-level supervision. Watch it if you build persistent agents and want adaptation to come from live deployment traces instead of offline labeling.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.