Zyphra releases ZAYA1-VL-8B with 700M active params and Apache 2.0
Zyphra released its first vision-language model, an 8B MoE with 700M active parameters and visual LoRA adapters. The model matters because it targets OCR, document reasoning, GUI interaction, and computer-use workloads under an Apache 2.0 license.

TL;DR
- Zyphra shipped ZyphraAI's launch post as its first vision-language model, a 700M-active, 8B-total MoE built on ZAYA1-8B and released under Apache 2.0 via the official blog post.
- According to ZyphraAI's capability list, ZAYA1-VL-8B is aimed at visual understanding, OCR, document reasoning, grounding, bounding boxes, and GUI interaction, which puts computer-use agents squarely in scope.
- The two big architectural changes, per ZyphraAI's architecture note and the Hugging Face model card, are bidirectional attention for image tokens and vision-only LoRA adapters on MLP and CCA weights.
- Zyphra says in ZyphraAI's training note and its announcement that the model was trained on roughly 140B multimodal tokens, far below the trillions used by many competing VLMs.
You can read the announcement, skim the model card, and trace the base text stack back to ZAYA1-8B. The weirdly useful detail is that Zyphra did not build the visual side from scratch: the model card says it keeps Zyphra's language decoder but plugs in the Qwen2.5-VL vision encoder, then layers its own attention and LoRA changes on top.
OCR, grounding, and GUI
The launch pitch is unusually concrete for a small VLM release. ZyphraAI's capability list names six target workloads:
- visual understanding
- OCR
- document reasoning
- grounding
- bounding boxes
- GUI interaction
The blog post sharpens that further into document understanding, spatial perception, and computer-use tasks. That gives the release a cleaner shape than a generic "multimodal" model card.
Vision-specific LoRA and bidirectional attention
Zyphra's core claim is that standard VLMs inherit two mismatches from text-first models, and both got explicit fixes in ZAYA1-VL-8B.
- Bidirectional attention for image tokens: the model does not force left-to-right causal attention across an image, because image tokens are not naturally ordered that way, according to ZyphraAI's architecture note.
- Vision-specific LoRA adapters: the model adds LoRA parameters on MLPs and CCA weights that activate only on vision tokens, giving the model dedicated visual capacity without retraining new experts from scratch, per ZyphraAI's routing fix note.
The model card adds one more useful implementation detail: ZAYA1-VL-8B keeps ZAYA1-8B as the text decoder and uses the Qwen2.5-VL vision encoder for the ViT.
140B tokens, not trillions
Zyphra says the model used roughly 140B multimodal tokens and leaned on higher-quality image and document data instead of video-heavy corpora. In the announcement, that smaller recipe is part of the pitch: better flop-efficiency, not just a smaller checkpoint.
The company also surfaced two MoE-specific training problems in ZyphraAI's training challenges note: effective batch size collapses when supervision lands only on answer tokens, and routing destabilizes when the model shifts from language-only to language-plus-vision. The vision-only LoRA bank is presented as the architectural buffer for that second problem.
Apache release and quick start
ZAYA1-VL-8B is out under Apache 2.0 with weights on Hugging Face. The model card says local use currently depends on Zyphra's zaya1-vl branch of transformers, based on v4.57.1, plus qwen-vl-utils==0.0.2 and flash_attn.
That requirement is the most practical caveat in the release. The model is open, but the easiest path today still runs through Zyphra's own fork rather than mainline Transformers.