Together Research released Aurora, an open-source speculative decoding system that retrains the draft model continuously from live accept-reject signals. The work claims better accepted length and throughput as traffic shifts, without requiring a fixed pretrained speculator.

Aurora is Together's open-source take on online speculative decoding: instead of treating the draft model as a fixed artifact, it keeps adapting during serving. The core claim in the launch thread is "1.25x faster vs. a well-trained static speculator" without an offline retraining pipeline, and the GitHub repo plus the paper are already public.
The system is positioned around a specific production problem. As the traffic shift thread puts it, static draft models "degrade as traffic patterns shift," so a speculator tuned on code generation can be wrong-footed once demand moves toward reasoning or conversation. That makes Aurora less about peak benchmark speed in a frozen setting and more about holding onto speculative decoding gains when the live request distribution keeps moving.
Together describes Aurora as a serve-to-train loop. The inference server emits per-request accept/reject outcomes, those traces go to a buffer, and an asynchronous training server updates the draft model before hot-swapping weights back into service, according to the architecture post. The practical angle is continuous updates without taking the serving path down.
The headline benchmark from Together is that online training from scratch surpassed two static alternatives. In the results post, Aurora reached an accepted length of 3.08 and 302.3 tok/s, versus 2.63 for a static pretrained baseline and 2.99 for a pretrained-plus-finetuned setup. That is the sharper claim here: not just that Aurora adapts online, but that Together argues offline pretraining is "not a prerequisite" for an effective speculative decoder. For implementers, that shifts the optimization target from periodically rebuilding a draft model offline to wiring training and serving into one loop, as described in Together's blog.
New from Together Research: Aurora. Speculative decoding that adapts to shifting traffic in real time — and keeps improving the longer it runs. Open-source, RL-based, 1.25x faster vs. a well-trained static speculator with no offline retraining pipeline. Thread 🧵
The design: a serve-to-train flywheel. The inference server streams accept/reject results from every request to a buffer. An async training server updates the draft model and hot-swaps weights back — zero service interruption.
The headline finding — online training from scratch surpasses a carefully pretrained static baseline: → Aurora: 3.08 accepted length, 302.3 tok/s → Static pretrained: 2.63 → Pretrained + finetuned: 2.99 Offline pretraining is not a prerequisite for effective speculative Show more