Z.ai released GLM-5V-Turbo, a multimodal coding model for screenshots, video, design drafts, and GUI-agent tasks. It keeps text-coding performance steady while adding native vision support, so teams can test visual workflows without swapping models.

You can read the official docs, copy Vercel's zai/glm-5v-turbo model ID from its AI Gateway changelog, and inspect the full benchmark sheets in Z.ai's coding table and the multimodal comparison image. The weirdly useful detail is that Z.ai did not just market this as vision plus code, it spelled out the exact training layers it thinks made the model work in the launch thread.
Z.ai's launch pitch is unusually specific. GLM-5V-Turbo is supposed to accept images, videos, design drafts, and document layouts natively, then use that visual context for coding and agent execution.
The official documentation fills in the product shape: text output only, 200K context, up to 128K output tokens, optional deep-thinking mode, function calling, streaming, and context caching (official docs). It also frames Claude Code and OpenClaw as first-class agent integrations, not afterthoughts.
Vercel's integration note makes the intended workflow even clearer. Its example prompt is not generic vision QA, it is "recreate this screenshot as a responsive React component with Tailwind CSS," which is exactly the design-to-code path Z.ai is selling (Vercel changelog).
The multimodal sheet shows a model that is strong across many tasks, but not uniformly best.
That is a better table than the launch copy alone suggests. The strong spots are visual search, Android GUI work, and design reconstruction. The weak spots are the familiar frontier-model holdouts, especially OSWorld.
The text-coding table matters because multimodal launches often leave open the question of whether a new visual encoder cost the model something elsewhere.
On Z.ai's own CC-Bench-V2 slice, GLM-5V-Turbo scores 22.8 on backend, 68.4 on frontend, and 72.2 on repo exploration. That puts it close to GLM-5-Turbo on backend and frontend, and ahead on repo exploration, while still trailing Claude Opus 4.6 across all three rows coding table.
The cleaner read is not "best coding model." It is that Z.ai added native vision without obviously wrecking the base coding profile it already had.
Z.ai attached a rare amount of model-making detail to the launch. It credits four system upgrades:
That stack lines up with the benchmark profile. A model trained on GUI action prediction and multimodal tool use should look better on AndroidWorld and WebVoyager than a plain vision-language model, and that is exactly where GLM-5V-Turbo looks most competitive training breakdown.
The rollout was broad on day one. Vercel exposed it through AI Gateway under zai/glm-5v-turbo (Vercel changelog); Arena put it into Vision Arena for head-to-head visual prompt testing Arena post; and OpenRouter listed it with concrete commercial specs (OpenRouter model page).
OpenRouter's listing gives the most useful operational numbers: 202,752 context, about 131K max output, $1.20 per million input tokens, and $4 per million output tokens. A later BridgeBench repost also claimed 221.2 tokens per second on SpeedBench BridgeBench repost, which at least suggests Z.ai is optimizing for throughput as hard as it is optimizing for modality.
Z.ai also appears to be threading the model directly into its own chat product. TestingCatalog's screenshot shows GLM-5V-Turbo already selectable inside z.ai alongside GLM-5 and GLM-5-Turbo, with "Visual Recognition" presented as a first-party task mode z.ai product screenshot.
Introducing GLM-5V-Turbo: Vision Coding Model - Native Multimodal Coding: Natively understands multimodal inputs including images, videos, design drafts, and document layouts. - Balanced Visual and Programming Capabilities: Achieves leading performance across core benchmarks for Show more
Zai has released GLM-5V-Turbo, a model designed to combine visual understanding with code generation. It natively processes images, videos, design drafts, and document layouts, and can generate runnable code from screenshots and web interfaces. According to Zai, the model leads Show more
Introducing GLM-5V-Turbo: Vision Coding Model - Native Multimodal Coding: Natively understands multimodal inputs including images, videos, design drafts, and document layouts. - Balanced Visual and Programming Capabilities: Achieves leading performance across core benchmarks for
Regarding pure-text coding, GLM-5V-Turbo maintains stable performance across three core benchmarks of CC-Bench-V2 (Backend, Frontend, and Repo Exploration), proving that the introduction of visual capabilities does not degrade text-based reasoning.
The leading performance of GLM-5V-Turbo stems from systematic upgrades across four levels: Native Multimodal Fusion: Deep fusion of text and vision begins at pre-training, with multimodal collaborative optimization during post-training. We developed the next-generation CogViT Show more
GLM 5V Turbo is now on AI Gateway. A vision-first coding model that sees your designs, screenshots, and GUIs to write and debug code. Use 𝚖𝚘𝚍𝚎𝚕: '𝚣𝚊𝚒/𝚐𝚕𝚖-𝟻𝚟-𝚝𝚞𝚛𝚋𝚘' to get started. vercel.com/changelog/glm-…
GLM-5V-Turbo is now live in Vision Arena. Test its ability to reason over visual inputs using your real-world prompts. Don't forget to vote so we can see how it stacks up.
Introducing GLM-5V-Turbo: Vision Coding Model - Native Multimodal Coding: Natively understands multimodal inputs including images, videos, design drafts, and document layouts. - Balanced Visual and Programming Capabilities: Achieves leading performance across core benchmarks for