releaseMay 6, 2026

Zyphra releases ZAYA1-8B with <1B active params and Markovian RSA reasoning

Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.

6 min read

Zyphra releases ZAYA1-8B with <1B active params and Markovian RSA reasoning

TL;DR

Zyphra has released ZyphraAI's launch post for ZAYA1-8B, an Apache 2.0 open-weight reasoning MoE with 8.4B total parameters but 0.76B active parameters, plus public weights and a hosted demo on Zyphra Cloud.
In Zyphra's own charts, ZyphraAI's benchmark claim places ZAYA1-8B ahead of Qwen3.5-4B and Gemma-4-E4B on math and code, while ZyphraAI's APEX-shortlist result says extra-high test-time compute can push it past DeepSeek-V3.2 and GPT OSS 120B High on APEX-shortlist.
The model's architectural pitch is unusually specific: ZyphraAI's architecture breakdown says Zyphra swapped in Compressed Convolutional Attention with 8x KV-cache compression, an MLP router with PID-controller bias balancing, and learned residual scaling.
Zyphra says the training recipe was reasoning-first from pretraining onward, because ZyphraAI's pretraining note introduced answer-preserving trimming for long chains of thought and ZyphraAI's RL cascade post describes a four-stage RL stack rather than a single post-training pass.
The weirdest part is Markovian RSA: according to ZyphraAI's Markovian RSA explanation, each round carries forward only the last τ tokens from candidate traces, and teortaxesTex's report screenshots show Zyphra framing that as a bounded-context alternative to full-chain recursive aggregation.

You can read the blog post, skim the technical report, pull the weights, and try it on Zyphra Cloud. The report screenshots in teortaxesTex's post expose exact config details that the launch thread only summarized, including 0.76B active params, 8.4B total params, 16 experts, Gemma3 tokenizer, and AMD MI300X plus Pollara networking. A separate reaction from teortaxesTex amplifying ypwang61 also ties Zyphra's training pipeline to the RLVE environments paper.

What shipped

ZAYA1-8B is an open-weight reasoning model, licensed under Apache 2.0, with four day-one surfaces in the launch materials: the official announcement, the technical report, Hugging Face weights, and Zyphra Cloud.

ZyphraAI's launch post describes it as a reasoning MoE trained on AMD and optimized for "intelligence density," with less than 1B active parameters. The config table visible in

sharpens that marketing line into 0.76B active parameters and 8.4B total parameters.

Zyphra is also positioning the model as a small-model reasoning play, not a general frontier announcement. In ZyphraAI's benchmark claim, the comparison set is Qwen3.5-4B, Gemma-4-E4B, DeepSeek-R1-0528, Gemini-2.5-Pro, and Claude 4.5 Sonnet, with math and code named as the target workloads.

MoE++

The architecture section is the cleanest part of the release. Zyphra says it changed three things relative to standard MoEs:

Compressed Convolutional Attention, CCA: sequence mixing in a compressed latent space, with 8x KV-cache compression.
ZAYA1 router: an MLP-based router with PID-controller bias balancing.
Learned residual scaling: an explicit residual-scaling change rather than a stock MoE block.

The report screenshot in

adds a few exact numbers that matter more than the buzzwords:

40 transformer layers
Hidden dimension 2048
16 experts per MoE layer
Top-1 routing, no residual expert
2x query compression
8x KV-cache compression relative to full multi-head attention
Gemma3 tokenizer with 262,272 vocabulary size

Commentary around the release focused on the router. In teortaxesTex's reply thread, teortaxesTex argues Zyphra reduced expert size and count versus the post-DSMoE trend, then pushed more complexity into a more expressive router instead.

Reasoning-first pretraining

Zyphra says reasoning was present from the start of pretraining, not added only in a later RL stage. The practical trick here is answer-preserving trimming: when reasoning traces are longer than the pretraining context budget, Zyphra drops the tail of the chain while keeping the final answer.

That choice matters because it is a direct concession to long-chain data inside short-context pretraining. The launch thread does not treat chain of thought as an inference-only scaffold.

Post-training then runs as a four-stage cascade on what Zyphra calls a shared algorithmic spine. The stages listed in the release are:

Reasoning warmup
RLVE-Gym curriculum
Math, code, and TTC RL
Behavioral RL

The implementation notes in ZyphraAI's RL cascade post name async PipelineRL, DPPO Binary-TV trust regions, Dr-GRPO loss aggregation, MaxRL advantages, and no KL-in-reward. teortaxesTex amplifying ypwang61 adds one extra breadcrumb, that the RLVE environments used in the pipeline come from the RLVE environments paper.

Markovian RSA

Markovian RSA is the release's most distinctive inference idea. Zyphra describes it as recursive candidate aggregation with bounded carryover: each round keeps only the final τ tokens from candidate traces, then uses those tails to seed the next round.

That bounded state is the whole pitch. Full-chain aggregation gets more expensive as traces get longer. Zyphra claims Markovian RSA decouples context growth from reasoning duration by forwarding only tails instead of entire chains.

The report diagram surfaced by teortaxesTex's screenshots shows the mechanism more concretely. A population of N candidate traces produces N tails, each new aggregation prompt samples C tails, and the next population is generated from those compressed prompts. The table in the same screenshot also sketches edge cases: T = 0 collapses to independent responses, C = 1 becomes a Delethink-style bounded continuation regime, and τ = β recovers full-chain RSA.

Zyphra's strongest performance claim hangs on this setup. With extra-high Markovian RSA compute, the company says ZAYA1-8B averages up to 5.5M tokens per question and beats DeepSeek-V3.2 plus GPT OSS 120B High on APEX-shortlist.

AMD stack

Zyphra is making the hardware story part of the model story. The company says ZAYA1-8B was developed on custom MI300X clusters with AMD Pensando Pollara networking, built on its earlier AMD pretraining work with IBM Cloud and AMD.

The claim in ZyphraAI's AMD infrastructure post is not just that the model ran on AMD. Zyphra specifically says MI300X's higher HBM enabled longer-context training with less parallelism, which is a more concrete hardware rationale than the usual accelerator name-drop.

The config table visible in

matches that framing by listing AMD MI300X with Pollara networking as the primary training hardware. Separate commentary in teortaxesTex on scale and AMD treats that partnership as the route to larger-scale models, and teortaxesTex's original reaction says an 80B model is upcoming, a detail that does not appear in the core Zyphra launch thread.