Skip to content
AI Primer
release

Zyphra releases ZAYA1-8B-Diffusion-Preview on AMD with 4.6x-7.7x faster decoding

Zyphra released ZAYA1-8B-Diffusion-Preview, its first diffusion language model trained on AMD, and said 16-token block generation delivers 4.6x-7.7x faster decoding with limited quality loss. The design targets autoregressive KV-cache bottlenecks while keeping post-training and test-time compute viable.

3 min read
Zyphra releases ZAYA1-8B-Diffusion-Preview on AMD with 4.6x-7.7x faster decoding
Zyphra releases ZAYA1-8B-Diffusion-Preview on AMD with 4.6x-7.7x faster decoding

TL;DR

  • ZyphraAI's launch thread says ZAYA1-8B-Diffusion-Preview is the first diffusion language model the company trained on AMD, with claimed decoding gains of 4.6x to 7.7x over its autoregressive base.
  • According to ZyphraAI's decoding mechanics post, the preview generates 16-token blocks from a mask prior, then uses speculative decoding plus a mix of autoregressive and diffusion logits to accept tokens.
  • ZyphraAI's bottleneck explanation frames the pitch around memory bandwidth: autoregressive decoding keeps reloading KV cache state, while diffusion shifts more of inference into compute.
  • Rather than train a diffusion model from scratch, ZyphraAI's conversion post says it converted ZAYA1-8B-base mid-training and added diffusion SFT, borrowing from the TiDAR recipe.

You can read the official blog post, check the 16-token block decoding description, and see Zyphra explicitly tie the design to AMD VRAM capacity and its CCA attention variant. The most interesting claim is not just faster decoding, but that diffusion makes cheaper on-policy rollouts and more test-time compute practical, which ZyphraAI's later thread post pitches as the bigger payoff.

Decoding 16-token blocks

The preview does not replace every autoregressive step. ZyphraAI's decoding mechanics post says it proposes 16-token blocks with diffusion, then accepts tokens using standard speculative decoding and a sampler that mixes autoregressive and diffusion logits.

That is where the speed claim lives. Zyphra says this setup goes further than TiDAR on speed, with some quality cost, while ZyphraAI's quality post says the overall degradation versus the autoregressive base stays minimal and some benchmarks improve.

KV cache bottlenecks

Zyphra's argument is that autoregressive decoding is memory-bandwidth bound because each new token reloads KV cache state, and each user's cache is loaded separately. In its telling, diffusion changes the shape of inference by generating blocks in parallel instead of marching one token at a time.

ZyphraAI's architecture post pushes the same point one step further: once decoding becomes compute-bound, larger amounts of RL and test-time compute become cheaper per dollar.

Conversion instead of scratch training

The company says diffusion training from scratch is still awkward because the inference upside does not show up during training. So the preview starts from ZAYA1-8B-base, applies diffusion conversion during mid-training, then runs diffusion SFT on the existing stack.

That makes this look more like a conversion pipeline than a brand new base model family. Zyphra attributes the quality retention to three things in its benchmark post: the conversion recipe, improved mid-training data, and the extra expressivity of non-causal generation.

AMD and CCA

Zyphra says both training and diffusion conversion ran on AMD, and it credits AMD's larger VRAM capacity for handling the longer sequences and extra overhead that diffusion introduces. The company pairs that hardware story with CCA, an attention variant it says is both FLOP- and memory-efficient.

ZyphraAI's CCA explanation makes the sharper claim: CCA reduces attention FLOPs, and because diffusion decoding behaves more like compute-bound prefill, CCA lets the model decode more tokens in parallel before compute becomes the limiter.

What Zyphra says comes next

The preview is not the end state Zyphra is promising. ZyphraAI's roadmap post says a fully post-trained diffusion model is coming soon, and ties the broader bet to more expressive non-causal generation plus cheaper on-policy rollouts.

That matters because the current release is explicitly a preview, not the finished RL-tuned version of this stack. The launch pitch is speed today, but the roadmap item is post-training economics.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 5 threads
TL;DR1 post
Decoding 16-token blocks1 post
KV cache bottlenecks1 post
Conversion instead of scratch training1 post
AMD and CCA1 post
Share on X