Skip to content
AI Primer
release

Zyphra releases ZAYA1-VL-8B with 700M active params and Apache 2.0

Zyphra released its first vision-language model, an 8B MoE with 700M active parameters and visual LoRA adapters. The model matters because it targets OCR, document reasoning, GUI interaction, and computer-use workloads under an Apache 2.0 license.

3 min read
Zyphra releases ZAYA1-VL-8B with 700M active params and Apache 2.0
Zyphra releases ZAYA1-VL-8B with 700M active params and Apache 2.0

TL;DR

You can read the announcement, skim the model card, and trace the base text stack back to ZAYA1-8B. The weirdly useful detail is that Zyphra did not build the visual side from scratch: the model card says it keeps Zyphra's language decoder but plugs in the Qwen2.5-VL vision encoder, then layers its own attention and LoRA changes on top.

OCR, grounding, and GUI

The launch pitch is unusually concrete for a small VLM release. ZyphraAI's capability list names six target workloads:

  • visual understanding
  • OCR
  • document reasoning
  • grounding
  • bounding boxes
  • GUI interaction

The blog post sharpens that further into document understanding, spatial perception, and computer-use tasks. That gives the release a cleaner shape than a generic "multimodal" model card.

Vision-specific LoRA and bidirectional attention

Zyphra's core claim is that standard VLMs inherit two mismatches from text-first models, and both got explicit fixes in ZAYA1-VL-8B.

  • Bidirectional attention for image tokens: the model does not force left-to-right causal attention across an image, because image tokens are not naturally ordered that way, according to ZyphraAI's architecture note.
  • Vision-specific LoRA adapters: the model adds LoRA parameters on MLPs and CCA weights that activate only on vision tokens, giving the model dedicated visual capacity without retraining new experts from scratch, per ZyphraAI's routing fix note.

The model card adds one more useful implementation detail: ZAYA1-VL-8B keeps ZAYA1-8B as the text decoder and uses the Qwen2.5-VL vision encoder for the ViT.

140B tokens, not trillions

Zyphra says the model used roughly 140B multimodal tokens and leaned on higher-quality image and document data instead of video-heavy corpora. In the announcement, that smaller recipe is part of the pitch: better flop-efficiency, not just a smaller checkpoint.

The company also surfaced two MoE-specific training problems in ZyphraAI's training challenges note: effective batch size collapses when supervision lands only on answer tokens, and routing destabilizes when the model shifts from language-only to language-plus-vision. The vision-only LoRA bank is presented as the architectural buffer for that second problem.

Apache release and quick start

ZAYA1-VL-8B is out under Apache 2.0 with weights on Hugging Face. The model card says local use currently depends on Zyphra's zaya1-vl branch of transformers, based on v4.57.1, plus qwen-vl-utils==0.0.2 and flash_attn.

That requirement is the most practical caveat in the release. The model is open, but the easiest path today still runs through Zyphra's own fork rather than mainline Transformers.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
Vision-specific LoRA and bidirectional attention1 post
140B tokens, not trillions1 post
Share on X