Skip to content
AI Primer
breaking

Parcae claims 1.3B Transformer quality from a 770M looped model

Together AI and UCSD released Parcae, a looped model that reuses layers with a constrained recurrent dynamic and reports stronger results than parameter-matched Transformers from 140M to 1.3B scales. The released models and code suggest recurrence can trade memory for quality under fixed FLOP budgets instead of scaling parameters alone.

4 min read
Parcae claims 1.3B Transformer quality from a 770M looped model
Parcae claims 1.3B Transformer quality from a 770M looped model

TL;DR

You can jump straight to the official blog post, read the paper, and browse the released training code and model collection. The interesting bit is not parameter compression, it is a training recipe that finally makes layer looping behave at Transformer-like learning rates. The blog also breaks the model into prelude, recurrent, and coda blocks, which makes the recurrence trick much easier to reason about than the usual vague "reuses layers" shorthand.

Stability

Looped models have been attractive for years because they increase compute without increasing parameter memory, but Together's thread on prior failures says the usual recipes diverged at learning rate 4e-4 in setups where standard Transformers trained at 1e-3.

The blog and paper frame the fix as a dynamical-systems problem. In the official blog post, the recurrent update is analyzed as a discrete LTI system over the residual stream, and Together's summary says Parcae keeps that system stable by learning a negative diagonal parameterization that constrains the spectral radius below 1.

Architecture

The official blog post splits a looped Transformer into three parts:

  1. Prelude: turns tokens into a latent state.
  2. Recurrent block: applies the same block for T loops, while reinjecting the input state each time.
  3. Coda: converts the final hidden state into logits.

That structure matters because the compute increase comes from extra passes through the recurrent block, not from adding more unique layers. Together's edge-inference note makes the pitch plainly: when memory is the real limit, looping offers another axis for quality besides parameter count.

Results

Together says Parcae beat every parameter and data matched Transformer they trained at 140M, 370M, 770M, and 1.3B scales. The clearest number in the release is the 370M comparison, where Parcae scores 20.00 on Core versus 17.46 for the baseline, a 14.5 percent gain.

The same post says Parcae also improved on earlier looped-model recipes, with up to 6.3 percent lower validation perplexity. In the official blog post, the larger claim is that a 770M Parcae lands near the downstream quality of a 1.3B fixed-depth Transformer trained on the same data.

Scaling laws

The paper does more than announce a stable recipe. Together's scaling-law thread says Parcae establishes scaling laws for looping itself, with recurrence and data needing to scale together rather than independently.

The concrete example in the release thread is at 128×10^18 FLOPs on the 370M setup, where looped Parcae reaches 20.1 versus 18.1 for fixed depth. The paper frames test-time looping as a separate compute knob too, with gains that taper rather than growing linearly, which is a more usable story than "just loop more".

Code and models

The release is unusually complete for a research model. Together's final thread post says the training code and models are already public, and the GitHub repository plus the Hugging Face collection list 140M, 370M, 770M, and 1.3B variants.

That also clarifies provenance. The paper lists Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y. Fu as authors, while Together's thread says Fu led the work and Together AI supplied compute.

🧾 More sources

Stability1 tweets
Why prior looped models failed and how Parcae constrains the recurrent dynamics.
Architecture1 tweets
How the model is organized into reusable blocks and why looping changes the memory-quality tradeoff.