Parcae claims 1.3B Transformer quality from a 770M looped model
Together AI and UCSD released Parcae, a looped model that reuses layers with a constrained recurrent dynamic and reports stronger results than parameter-matched Transformers from 140M to 1.3B scales. The released models and code suggest recurrence can trade memory for quality under fixed FLOP budgets instead of scaling parameters alone.

TL;DR
- Together AI's launch thread says Parcae is a looped architecture that reuses the same layers multiple times, and a 770M version reaches the quality of a 1.3B Transformer trained on the same data.
- According to Together's stability explainer and the official paper, Parcae stabilizes recurrence by constraining the recurrent injection dynamics so the spectral radius stays below 1.
- Together's results snapshot reports wins over parameter and data matched Transformers from 140M through 1.3B, including a 370M Core score of 20.00 versus 17.46 for the baseline.
- Together's scaling-law note says looping and training data follow joint power laws, so extra recurrence only pays off when data scales with it under a fixed FLOP budget.
- Together's release post links live artifacts, including the blog post, GitHub repo, and Hugging Face collection.
You can jump straight to the official blog post, read the paper, and browse the released training code and model collection. The interesting bit is not parameter compression, it is a training recipe that finally makes layer looping behave at Transformer-like learning rates. The blog also breaks the model into prelude, recurrent, and coda blocks, which makes the recurrence trick much easier to reason about than the usual vague "reuses layers" shorthand.
Stability
Looped models have been attractive for years because they increase compute without increasing parameter memory, but Together's thread on prior failures says the usual recipes diverged at learning rate 4e-4 in setups where standard Transformers trained at 1e-3.
The blog and paper frame the fix as a dynamical-systems problem. In the official blog post, the recurrent update is analyzed as a discrete LTI system over the residual stream, and Together's summary says Parcae keeps that system stable by learning a negative diagonal parameterization that constrains the spectral radius below 1.
Architecture
The official blog post splits a looped Transformer into three parts:
- Prelude: turns tokens into a latent state.
- Recurrent block: applies the same block for T loops, while reinjecting the input state each time.
- Coda: converts the final hidden state into logits.
That structure matters because the compute increase comes from extra passes through the recurrent block, not from adding more unique layers. Together's edge-inference note makes the pitch plainly: when memory is the real limit, looping offers another axis for quality besides parameter count.
Results
Together says Parcae beat every parameter and data matched Transformer they trained at 140M, 370M, 770M, and 1.3B scales. The clearest number in the release is the 370M comparison, where Parcae scores 20.00 on Core versus 17.46 for the baseline, a 14.5 percent gain.
The same post says Parcae also improved on earlier looped-model recipes, with up to 6.3 percent lower validation perplexity. In the official blog post, the larger claim is that a 770M Parcae lands near the downstream quality of a 1.3B fixed-depth Transformer trained on the same data.
Scaling laws
The paper does more than announce a stable recipe. Together's scaling-law thread says Parcae establishes scaling laws for looping itself, with recurrence and data needing to scale together rather than independently.
The concrete example in the release thread is at 128×10^18 FLOPs on the 370M setup, where looped Parcae reaches 20.1 versus 18.1 for fixed depth. The paper frames test-time looping as a separate compute knob too, with gains that taper rather than growing linearly, which is a more usable story than "just loop more".
Code and models
The release is unusually complete for a research model. Together's final thread post says the training code and models are already public, and the GitHub repository plus the Hugging Face collection list 140M, 370M, 770M, and 1.3B variants.
That also clarifies provenance. The paper lists Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y. Fu as authors, while Together's thread says Fu led the work and Together AI supplied compute.