breakingMarch 21, 2026

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.

3 min read

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

TL;DR

Physical Intelligence says its new RL token turns a VLA's internal state into a compact signal that a small on-robot RL policy can use for fast adaptation, instead of retraining the full model RL token thread.
The company is aiming at the "last millimeter" of manipulation, where its research summary says general-purpose robot models still struggle with precision tasks like alignment and tool use.
In the reported setup, an actor-critic policy learns a residual correction on top of the base model's action, and the thread claims robots become "up to 3× faster" after as little as 15 minutes of real-world practice.
Physical Intelligence's research page also says the lightweight policy can train with off-policy RL at hundreds of updates per second on the robot, which is the key implementation detail behind the short adaptation loop research summary.

What exactly shipped?

Physical Intelligence is pitching RL token as a narrow interface between a large vision-language-action model and a much smaller reinforcement learning module. In the company's research summary, the VLA's high-dimensional embeddings are compressed through an encoder bottleneck into a low-dimensional token, which is then optimized to preserve task-relevant information.

That token becomes the input to a small actor-critic policy that runs online. According to the thread, the actor takes both the RL token and the base model's proposed action, then learns a residual correction rather than replacing the full policy. The

shows the loop explicitly: rollout on the robot, replay buffer, then repeated actor-critic updates tied back to the tokenized state.

Why does this matter for robot deployment?

The practical claim is not broader autonomy but faster refinement at the hardest part of execution. Physical Intelligence's announcement thread frames the failure mode as the "last millimeter," where coarse model outputs are good enough to reach the object but not to complete delicate alignment, insertion, or tool-use steps.

The reported advantage is sample efficiency. The research page says the small RL module can be trained directly on the robot with off-policy updates at hundreds of steps per second, and the thread says that was enough to improve behavior in about 15 minutes of practice. The same sources claim up to 3× faster task execution, fewer mistakes, and in some cases performance faster than human teleoperation.

For engineers, the architectural point is clear: keep the large pretrained VLA as the general controller, then add a compact, fast-learning adapter for precision corrections. That avoids full-model online tuning while still giving the system a way to specialize during deployment research summary.