breakingMarch 31, 2026

Together releases Aurora for online speculative decoding with 1.25x throughput

Together Research released Aurora, an open-source speculative decoding system that retrains the draft model continuously from live accept-reject signals. The work claims better accepted length and throughput as traffic shifts, without requiring a fixed pretrained speculator.

LLM Serving Inference Optimization Reinforcement Learning

3 min read

Together releases Aurora for online speculative decoding with 1.25x throughput

TL;DR

Together Research says Aurora launch Aurora is an open-source speculative decoding system that keeps retraining its draft model from live traffic, aiming to avoid the drift that hits static speculators as request mix changes.
In Together's framing, the problem is that a draft model trained for one domain becomes stale when traffic shifts toward another; traffic shift thread describes the failure mode as code-tuned speculators degrading when workloads move toward reasoning or chat.
The serving design couples inference and training: serve-train flywheel says accept/reject signals stream into a buffer, an async trainer updates the draft model, and new weights are hot-swapped back in with "zero service interruption."
Together's reported result is that online training can beat a static baseline: headline finding gives Aurora an accepted length of 3.08 and throughput of 302.3 tok/s, versus 2.63 for a static pretrained speculator, while open-source links points to the released blog, paper, and code.

What actually shipped?

Together AI

@togethercompute

·Follow

New from Together Research: Aurora. Speculative decoding that adapts to shifting traffic in real time — and keeps improving the longer it runs. Open-source, RL-based, 1.25x faster vs. a well-trained static speculator with no offline retraining pipeline. Thread 🧵

9:57 PM · Mar 31, 2026

126

Read 2 replies

Aurora is Together's open-source take on online speculative decoding: instead of treating the draft model as a fixed artifact, it keeps adapting during serving. The core claim in the launch thread is "1.25x faster vs. a well-trained static speculator" without an offline retraining pipeline, and the GitHub repo plus the paper are already public.

The system is positioned around a specific production problem. As the traffic shift thread puts it, static draft models "degrade as traffic patterns shift," so a speculator tuned on code generation can be wrong-footed once demand moves toward reasoning or conversation. That makes Aurora less about peak benchmark speed in a frozen setting and more about holding onto speculative decoding gains when the live request distribution keeps moving.

How does Aurora adapt, and what did Together measure?

Together AI

@togethercompute

·Follow

Replying to @togethercompute

The design: a serve-to-train flywheel. The inference server streams accept/reject results from every request to a buffer. An async training server updates the draft model and hot-swaps weights back — zero service interruption.

9:57 PM · Mar 31, 2026

Read 1 reply

Together describes Aurora as a serve-to-train loop. The inference server emits per-request accept/reject outcomes, those traces go to a buffer, and an asynchronous training server updates the draft model before hot-swapping weights back into service, according to the architecture post. The practical angle is continuous updates without taking the serving path down.

Together AI

@togethercompute

·Follow

Replying to @togethercompute

The headline finding — online training from scratch surpasses a carefully pretrained static baseline: → Aurora: 3.08 accepted length, 302.3 tok/s → Static pretrained: 2.63 → Pretrained + finetuned: 2.99 Offline pretraining is not a prerequisite for effective speculative Show more

9:57 PM · Mar 31, 2026

Read 1 reply

The headline benchmark from Together is that online training from scratch surpassed two static alternatives. In the results post, Aurora reached an accepted length of 3.08 and 302.3 tok/s, versus 2.63 for a static pretrained baseline and 2.99 for a pretrained-plus-finetuned setup. That is the sharper claim here: not just that Aurora adapts online, but that Together argues offline pretraining is "not a prerequisite" for an effective speculative decoder. For implementers, that shifts the optimization target from periodically rebuilding a draft model offline to wiring training and serving into one loop, as described in Together's blog.

🧾 More sources

TL;DR2 tweets

Top-line launch facts, problem framing, architecture, and reported performance from Together's announcement thread.

What actually shipped?2 tweets

Covers the release itself, why Together says static speculators fail in production, and where the code and paper live.

How does Aurora adapt, and what did Together measure?1 tweets

Groups the system design details with the headline accepted-length and throughput numbers.