Z.ai launches GLM-5V-Turbo for screenshot coding and GUI-agent tasks
Z.ai released GLM-5V-Turbo, a multimodal coding model for screenshots, video, design drafts, and GUI-agent tasks. It keeps text-coding performance steady while adding native vision support, so teams can test visual workflows without swapping models.

TL;DR
- Z.ai's launch thread positioned GLM-5V-Turbo as a native multimodal coding model for images, video, design drafts, and document layouts, while the official docs describe it as the company's first multimodal coding foundation model.
- According to Z.ai's benchmark post, the model keeps its CC-Bench-V2 backend, frontend, and repo-exploration scores in the same range as GLM-5-Turbo, which is the core claim behind the "vision without text-coding regression" pitch.
- Z.ai's architecture thread attributes the launch to four changes: native text-vision fusion, reinforcement learning across 30-plus task types, agentic data generation, and a multimodal toolchain that extends beyond text-only tools.
- bridgemindai's OpenRouter screenshot and OpenRouter's model card both list a roughly 202K token context window, with OpenRouter showing $1.20 per million input tokens and $4 per million output tokens.
- Vercel's AI Gateway announcement added the model under
zai/glm-5v-turbo, while Vision Arena's launch post put it into public head-to-head testing the same day.
You can read the official model docs, check Vercel's AI Gateway example for screenshot-to-React prompting, and compare the public packaging on OpenRouter. The useful oddity is that Z.ai is selling this as both a design-to-code model and a Claude Code or OpenClaw component, so the launch page, the benchmark tables, and the distribution partners all describe the same perceive, plan, execute loop from slightly different angles.
Screenshot coding and GUI loops
the official docs split the model's main use cases into three concrete buckets: front-end recreation from designs, autonomous GUI exploration and recreation, and screenshot-based debugging. The docs also say it can plug into Claude Code and OpenClaw to move from static mockup reproduction to browsing a live site, collecting visual details, and generating code from what it finds.
That makes this a narrower and more useful launch than a generic "multimodal model" release. Vercel's announcement describes the same workflow in plainer developer language: turn screenshots and designs into code, debug visually, and operate GUIs autonomously.
Four-system upgrade
Z.ai's thread and the long-form docs line up on the training story. Z.ai says the model's gains come from four linked upgrades:
- Native multimodal fusion: text and vision are fused from pretraining onward, using a new CogViT encoder and an inference-friendly MTP structure.
- 30-plus task collaborative RL: the reinforcement learning stage jointly optimizes STEM, grounding, video, GUI-agent, and coding-agent tasks instead of tuning one narrow domain.
- Agentic data construction: Z.ai says it built synthetic, verifiable environments to generate action data and injected agentic meta-capabilities during pretraining.
- Multimodal tools: the docs say the toolchain now includes visual actions such as screenshot reading, webpage reading with image recognition, and box drawing, not just text tools.
The interesting bit is the tool claim. Most model launches stop at perception. Z.ai is explicitly arguing that the important boundary is the full loop, which the docs phrase as understanding the environment, planning actions, and executing the task.
Benchmarks: strong on design and Android, mixed against Opus on some GUI tests
The benchmark table in the launch recap is better read row by row than as one grand victory lap.
- Design2Code: GLM-5V-Turbo scores 94.8, ahead of Kimi K2.5 at 91.3 and Claude Opus 4.6 at 77.3.
- Flame-VLM-Code: Claude still leads, 98.8 versus GLM-5V-Turbo's 93.8.
- BrowseComp-VL, MMSearch, MMSearch-Plus, SimpleVQA, V*: GLM-5V-Turbo leads the visible comparison rows in the shared table.
- AndroidWorld: GLM-5V-Turbo posts 75.7, ahead of Kimi's 43.1 and Opus 4.6's 62.0.
- OSWorld: Claude Opus 4.6 remains ahead at 72.2, versus GLM-5V-Turbo's 62.3.
- WebVoyager: GLM-5V-Turbo edges Opus 4.6, 88.5 to 88.0.
The text-only coding claim is more conservative. Z.ai's separate post says the visual upgrade did not degrade CC-Bench-V2 backend, frontend, and repo-exploration performance, and the posted table mostly supports that framing: GLM-5V-Turbo is close to GLM-5-Turbo on all three rows, but still trails Claude Opus 4.6 on each one.
API packaging and price
OpenRouter's listing shows the public packaging clearly: release date April 1, 2026, 202,752 tokens of context, about 131K max output, $1.20 per million input tokens, $4 per million output tokens, and discounted cache reads. bridgemindai's screenshot surfaced the same numbers within hours of launch.
The official docs round those limits to 200K context and 128K max output, which is the usual difference between product-page decimals and docs-page integers. Vercel exposed the model the same day under zai/glm-5v-turbo, with a sample prompt for recreating a screenshot as a responsive React component with Tailwind CSS in the AI Gateway changelog.
Public evals started immediately
Vision Arena added GLM-5V-Turbo on launch day, and Arena's follow-up pushed users to upload their own image prompts at arena.ai. That matters because the internal benchmark spread is uneven enough that public pairwise testing will probably tell the clearer story.
Another small distribution signal showed up in testingcatalog's screenshot, where GLM-5V-Turbo appeared as a new selectable model inside Z.ai chat alongside deep research, web development, document reading, and visual recognition shortcuts. The rollout was not just an API drop, it landed in the consumer chat surface, on Vercel's gateway, on OpenRouter, and in a public arena within a few hours.