GLM-5.2 ranks #1 on Vals and Design Arena, AA Coding Index hits 50.7
Fresh third-party results put GLM-5.2 atop multiple open-model leaderboards, including the AA Coding Index, Vals Index, Terminal Bench 2.1, and Design Arena. The scores add independent confirmation, though demand spiked enough to strain some providers.

TL;DR
- Artificial Analysis' launch breakdown put GLM-5.2 at 51 on the Intelligence Index v4.1, 11 points above GLM-5.1, while bridgemindai's coding-index note tracked a separate jump from 43.4 to 50.7 on the Artificial Analysis Coding Index.
- ValsAI's benchmark summary said GLM-5.2 is now the top open-weight model on the Vals Index, Vibe Code Bench, and Terminal Bench 2.1, and ValsAI's follow-up pegged those gains at +13% on Vals, +31% on Vibe Code Bench, and +11% on Terminal Bench 2.1 versus GLM-5.1.
- WesRoth's Design Arena post said GLM-5.2 reached #1 on Design Arena at 1360 Elo, up 27 points, while WesRoth's Code Arena recap placed it #2 on the frontend leaderboard at 1595, 29 points ahead of Claude Opus 4.7 Thinking.
- The release is unusually broad for an open-weight model: Artificial Analysis listed MIT-licensed weights, 744B total parameters with 40B active, 1M context, and flat API pricing versus GLM-5.1, while SGLang's day-0 post and Ollama's integration thread showed it landing in tools immediately.
- Demand hit providers fast. OpenCode's capacity note said traffic jumped 3x, Ollama's reply said capacity was being added after stronger-than-expected demand, and dingyi's hands-on report described a full day of ZCode use with low latency and little rework.
You can open the GLM Coding Plan page, browse SGLang's cookbook, inspect the Agent Arena methodology, and dig through the Code Arena leaderboard. Simon Willison's writeup is also useful because it calls out the least flattering number in the launch wave: GLM-5.2 appears to spend far more output tokens per task than the previous release.
Artificial Analysis
The cleanest third-party summary came from Artificial Analysis. Its v4.1 index moved to longer and harder agentic workloads, replacing older components with Terminal-Bench 2.1, τ³-Bench Banking, and GDPval-AA v2, as WesRoth's index-change summary noted.
Within that updated mix, Artificial Analysis gave GLM-5.2 three numbers that matter most:
- Intelligence Index: 51, up from 40 for GLM-5.1, a gain of 11 points.
- GDPval-AA v2: 1524, ahead of MiniMax-M3 at 1418 and DeepSeek V4 Pro at 1328.
- TerminalBench v2.1: 78%, up 16 points from GLM-5.1.
The weirdest benchmark in the batch is CritPt, a private set of research-level physics problems. Artificial Analysis' CritPt post said GLM-5.2 hit 20.9%, tied Claude Opus 4.8, beat several proprietary models, and marked a 4.5x jump from GLM-5.1's 4.6%.
The cost story is less clean. Artificial Analysis said list pricing stayed flat at $1.4 per 1M input tokens, $0.26 per 1M cache-hit tokens, and $4.4 per 1M output tokens, but the same post also said GLM-5.2 used 43K output tokens per task, up from 26K on GLM-5.1. Simon Willison's weblog post highlighted the same token appetite.
Vals stack
ValsAI added a second independent read on the coding story. ValsAI's benchmark summary placed GLM-5.2 at the top of the open-weight pack on the Vals Index, Vibe Code Bench, and Terminal Bench 2.1, with an overall rank of #5 across all models.
ValsAI's follow-up broke the deltas out more explicitly:
- Vals Index: +13% versus GLM-5.1.
- Vibe Code Bench: +31% versus GLM-5.1.
- Terminal Bench 2.1: +11% versus GLM-5.1, enough to take #1 from Kimi K2.7 Code.
The eval setup matters here. ValsAI's eval-settings note said the tested model was text-only, used Z AI's API, ran with maximum reasoning effort and a 1M context window, and allowed up to 131K output tokens. The same post says Terminal Bench 2.1 was run with the benchmark's default timeout settings, while Z.ai's own blog used a four-hour timeout across tasks.
That is the useful pattern across this launch wave: one leaderboard says GLM-5.2 is good, but several unrelated coding-oriented boards say roughly the same thing.
Design Arena
Design and frontend are where the release starts to look stranger than a normal text-only model bump. WesRoth's Design Arena post said GLM-5.2 reached first place on Design Arena with a 1360 Elo, gaining 27 points and moving ahead of the unavailable Claude Fable 5.
Arena's own frontend leaderboard put it one rung lower overall, but still in rare company. According to Arena's frontend leaderboard thread, GLM-5.2 Max:
- Ranked #2 in Code Arena: Frontend.
- Scored 1595, which was 29 points ahead of Claude Opus 4.7 Thinking.
- Ranked #2 for React and #4 for HTML.
- Led or nearly led categories including brand and marketing, reference-based design, data and analytics, consumer products, gaming, and simulations.
WesRoth's frontend recap repeated the same 1595 score and the 29-point gap over Opus 4.7 Thinking. Simon Willison's weblog post flagged the same surprise from another angle: GLM-5.2 is text-only, yet still climbed to the top tier on a web development leaderboard that often rewards agentic frontend work.
Under the hood
The model details that kept showing up across posts are concrete enough to separate from the benchmark chatter.
- Size: Artificial Analysis listed 744B total parameters and 40B active parameters, unchanged from GLM-5.1.
- Context: Artificial Analysis and SGLang's day-0 support post both listed a 1M token context window, up from 200K on GLM-5.1.
- Effort modes: SGLang's day-0 support post described High and Max reasoning settings to trade latency for depth.
- Long-context architecture: SGLang's day-0 support post said IndexShare reduces per-token FLOPs by 2.9x at 1M context, while Together AI's launch thread repeated the same claim for tool-heavy, long-context workloads.
- Decoding: SGLang's day-0 support post said improved MTP lifted speculative-decoding acceptance by up to 20%.
- Training: Cedric Chee's RL thread said GLM-5.2 used critic-based PPO instead of GRPO for long coding trajectories, split super-long traces into sub-traces for training, and added anti-hacking safeguards.
The most specific training anecdote came from Cedric Chee on reward hacking, who said Z.ai had to deal with reward hacking during RL and used an LLM judge that scores the intent behind tool calls instead of just rejecting an action and terminating the trajectory.
Where it shows up
This model was not confined to Z.ai's own surface for long. The ecosystem rollout was immediate and unusually broad for a 744B-class open-weight release.
- Ollama's GLM-5.2 thread added hosted support on Blackwell GPUs, with launch commands for Claude Code, Codex App, Hermes Agent, and chat.
- Ollama's Codex post separately called out Codex support via
ollama launch codexandollama launch codex-app. - SGLang's day-0 support post shipped day-0 inference support plus a cookbook.
- Cline's availability post said GLM-5.2 was available in Cline and called it the first open-weights model above 80% on Terminal-Bench.
- Together AI's launch thread put GLM-5.2 on Together AI with the same 1M context, flexible effort levels, and MIT-licensed weights.
- Artificial Analysis listed third-party availability across DeepInfra, Novita, Nebius, Parasail, SiliconFlow, GMI Cloud, Baseten, and Fireworks.
There was also a first-party product angle. WesRoth's launch note said GLM-5.2 first landed for GLM Coding Plan users across Lite, Pro, Max, and Team, with API and chatbot access following later. On the client side, HCSolakoglu's Linux install post surfaced direct Linux builds of ZCode that were already live on Zhipu AI's CDN.
Capacity crunch
The fastest confirmation that something had landed was not another benchmark card, it was providers running hot. OpenCode's capacity note said GLM-5.2 demand was running at 3x normal levels. Ollama's reply said the service was adding capacity after stronger-than-anticipated demand, and another Ollama reply said more capacity was still being brought online.
Hands-on reports were positive enough to explain the rush. haider1's long-horizon test said GLM-5.2 held context through a 12-step refactor and had more reliable tool calling than the author was used to. dingyi's ZCode hands-on report described a day of refactor work with little lag, little rework, and five hours of remaining quota after a completed task.
Not every anecdote was clean. emollick's shader test follow-up said one run produced blank screens and another produced errors, even though the model also showed strong results on visual coding prompts in the same thread. That split fits the broader picture here: the leaderboard story is already strong, but the real test has moved to how often these long agentic runs stay on the rails when lots of people hit them at once.