Skip to content
AI Primer
update

GLM-5.2 ranks #1 on Vals and Design Arena, AA Coding Index hits 50.7

Fresh third-party results put GLM-5.2 atop multiple open-model leaderboards, including the AA Coding Index, Vals Index, Terminal Bench 2.1, and Design Arena. The scores add independent confirmation, though demand spiked enough to strain some providers.

8 min read
GLM-5.2 ranks #1 on Vals and Design Arena, AA Coding Index hits 50.7
GLM-5.2 ranks #1 on Vals and Design Arena, AA Coding Index hits 50.7

TL;DR

You can open the GLM Coding Plan page, browse SGLang's cookbook, inspect the Agent Arena methodology, and dig through the Code Arena leaderboard. Simon Willison's writeup is also useful because it calls out the least flattering number in the launch wave: GLM-5.2 appears to spend far more output tokens per task than the previous release.

Artificial Analysis

The cleanest third-party summary came from Artificial Analysis. Its v4.1 index moved to longer and harder agentic workloads, replacing older components with Terminal-Bench 2.1, τ³-Bench Banking, and GDPval-AA v2, as WesRoth's index-change summary noted.

Within that updated mix, Artificial Analysis gave GLM-5.2 three numbers that matter most:

  • Intelligence Index: 51, up from 40 for GLM-5.1, a gain of 11 points.
  • GDPval-AA v2: 1524, ahead of MiniMax-M3 at 1418 and DeepSeek V4 Pro at 1328.
  • TerminalBench v2.1: 78%, up 16 points from GLM-5.1.

The weirdest benchmark in the batch is CritPt, a private set of research-level physics problems. Artificial Analysis' CritPt post said GLM-5.2 hit 20.9%, tied Claude Opus 4.8, beat several proprietary models, and marked a 4.5x jump from GLM-5.1's 4.6%.

The cost story is less clean. Artificial Analysis said list pricing stayed flat at $1.4 per 1M input tokens, $0.26 per 1M cache-hit tokens, and $4.4 per 1M output tokens, but the same post also said GLM-5.2 used 43K output tokens per task, up from 26K on GLM-5.1. Simon Willison's weblog post highlighted the same token appetite.

Vals stack

ValsAI added a second independent read on the coding story. ValsAI's benchmark summary placed GLM-5.2 at the top of the open-weight pack on the Vals Index, Vibe Code Bench, and Terminal Bench 2.1, with an overall rank of #5 across all models.

ValsAI's follow-up broke the deltas out more explicitly:

  • Vals Index: +13% versus GLM-5.1.
  • Vibe Code Bench: +31% versus GLM-5.1.
  • Terminal Bench 2.1: +11% versus GLM-5.1, enough to take #1 from Kimi K2.7 Code.

The eval setup matters here. ValsAI's eval-settings note said the tested model was text-only, used Z AI's API, ran with maximum reasoning effort and a 1M context window, and allowed up to 131K output tokens. The same post says Terminal Bench 2.1 was run with the benchmark's default timeout settings, while Z.ai's own blog used a four-hour timeout across tasks.

That is the useful pattern across this launch wave: one leaderboard says GLM-5.2 is good, but several unrelated coding-oriented boards say roughly the same thing.

Design Arena

Design and frontend are where the release starts to look stranger than a normal text-only model bump. WesRoth's Design Arena post said GLM-5.2 reached first place on Design Arena with a 1360 Elo, gaining 27 points and moving ahead of the unavailable Claude Fable 5.

Arena's own frontend leaderboard put it one rung lower overall, but still in rare company. According to Arena's frontend leaderboard thread, GLM-5.2 Max:

  • Ranked #2 in Code Arena: Frontend.
  • Scored 1595, which was 29 points ahead of Claude Opus 4.7 Thinking.
  • Ranked #2 for React and #4 for HTML.
  • Led or nearly led categories including brand and marketing, reference-based design, data and analytics, consumer products, gaming, and simulations.

WesRoth's frontend recap repeated the same 1595 score and the 29-point gap over Opus 4.7 Thinking. Simon Willison's weblog post flagged the same surprise from another angle: GLM-5.2 is text-only, yet still climbed to the top tier on a web development leaderboard that often rewards agentic frontend work.

Under the hood

The model details that kept showing up across posts are concrete enough to separate from the benchmark chatter.

The most specific training anecdote came from Cedric Chee on reward hacking, who said Z.ai had to deal with reward hacking during RL and used an LLM judge that scores the intent behind tool calls instead of just rejecting an action and terminating the trajectory.

Where it shows up

This model was not confined to Z.ai's own surface for long. The ecosystem rollout was immediate and unusually broad for a 744B-class open-weight release.

  • Ollama's GLM-5.2 thread added hosted support on Blackwell GPUs, with launch commands for Claude Code, Codex App, Hermes Agent, and chat.
  • Ollama's Codex post separately called out Codex support via ollama launch codex and ollama launch codex-app.
  • SGLang's day-0 support post shipped day-0 inference support plus a cookbook.
  • Cline's availability post said GLM-5.2 was available in Cline and called it the first open-weights model above 80% on Terminal-Bench.
  • Together AI's launch thread put GLM-5.2 on Together AI with the same 1M context, flexible effort levels, and MIT-licensed weights.
  • Artificial Analysis listed third-party availability across DeepInfra, Novita, Nebius, Parasail, SiliconFlow, GMI Cloud, Baseten, and Fireworks.

There was also a first-party product angle. WesRoth's launch note said GLM-5.2 first landed for GLM Coding Plan users across Lite, Pro, Max, and Team, with API and chatbot access following later. On the client side, HCSolakoglu's Linux install post surfaced direct Linux builds of ZCode that were already live on Zhipu AI's CDN.

Capacity crunch

The fastest confirmation that something had landed was not another benchmark card, it was providers running hot. OpenCode's capacity note said GLM-5.2 demand was running at 3x normal levels. Ollama's reply said the service was adding capacity after stronger-than-anticipated demand, and another Ollama reply said more capacity was still being brought online.

Hands-on reports were positive enough to explain the rush. haider1's long-horizon test said GLM-5.2 held context through a 12-step refactor and had more reliable tool calling than the author was used to. dingyi's ZCode hands-on report described a day of refactor work with little lag, little rework, and five hours of remaining quota after a completed task.

Not every anecdote was clean. emollick's shader test follow-up said one run produced blank screens and another produced errors, even though the model also showed strong results on visual coding prompts in the same thread. That split fits the broader picture here: the leaderboard story is already strong, but the real test has moved to how often these long agentic runs stay on the rails when lots of people hit them at once.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR4 posts
Artificial Analysis1 post
Vals stack1 post
Design Arena1 post
Under the hood2 posts
Where it shows up5 posts
Capacity crunch3 posts
·
Other sources· 1 post

GLM-5.2 is probably the most powerful text-only open weights LLM

Chinese AI lab Z.ai released GLM-5.2 to their coding plan subscribers on June 13th, and then yesterday (June 16th) released the full open weights under an MIT license. Similar in size to their previous GLM-5 and GLM-5.1 releases, this is 753B parameter, 1.51TB monster - with 40 active parameters (Mixture of Experts). GLM-5.2 is a text input only model - Z.ai have a separate vision family most recently represented by GLM-5V-Turbo, but that one isn't open weights. GLM-5.2 has a 1 million token context window, up from GLM-5.1's 200,000. The buzz around this model is strong. Artificial Analysis, who run one of the most widely respected independent benchmarks: GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index. GLM-5.2 is the leading open weights model on the Intelligence Index v4.1. At 51, it leads MiniMax-M3 (44), DeepSeek V4 Pro (max, 44) and Kimi K2.6 (43) They did however find it to be quite token-hungry: GLM-5.2 uses more output tokens per task than other leading open weights models: the model uses 43k output tokens per Intelligence Index task, up from GLM-5.1 (26k) and above MiniMax-M3 (24k), Kimi K2.6 (35k) and DeepSeek V4 Pro (max, 37k) The model is also now ranked 2nd on the Code Arena WebDev leaderboard, behind only Claude Fable 5. That leaderboard measures "front-end web development tasks, including agentic coding workflows". I'm impressed to see it rank so highly given the lack of image input, which I had incorrectly assumed wa

Share on X