RL, RFT, and environment-driven training for agent behavior.
Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.
Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.
NVIDIA published Nemotron-Cascade 2, a 30B MoE with 3B active parameters, claiming IMO gold-level math and Kimi K2.5-class code scores, then pushed it to Hugging Face and Ollama. It is worth testing if you want an open agent model with immediate local and hosted paths.
Mistral introduced Forge, a platform for enterprises to pre-train, post-train, and reinforce models on internal code, policies, and operational data, including on-prem deployments. Consider it when retrieval alone is not enough and you need weights tuned to private workflows.
H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.
OpenClaw-RL released a fully asynchronous online training stack that turns live interaction feedback into ongoing agent updates with binary rewards and token-level OPD corrections. Use it as a starting point for online agent improvement only if you can score rollouts reliably and manage privacy risk.
UT Austin researchers report that simple sequential fine-tuning with LoRA and on-policy RL can retain prior skills while learning new VLA tasks. Try this baseline before reaching for more complex continual-learning methods.
The OpenClaw-RL paper proposes training agents continuously from normal interactions by turning user corrections, logs, and next-state feedback into rewards and word-level supervision. Watch it if you build persistent agents and want adaptation to come from live deployment traces instead of offline labeling.