Skip to content
AI Primer
update

Engineers report GLM-5.2 matches near-Opus planning at about 1/10 the price

Independent tests put GLM-5.2 near Opus 4.8 and GPT-5.5 on planning and coding, and users shared Claude Code, BrowserCode, dcode, and local-serving recipes. It matters because many engineers are treating it as a daily-driver option for text-heavy coding, though teams still report weaker vision and provider limits.

7 min read
Engineers report GLM-5.2 matches near-Opus planning at about 1/10 the price
Engineers report GLM-5.2 matches near-Opus planning at about 1/10 the price

TL;DR

Fast paths

You can skim Kilo Code's full breakdown, grab a local quantization guide, or use a vLLM serving guide plus a GLM serving recipe. For architecture detail, Sebastian Raschka's earlier GLM write-up is the cleanest pointer in the evidence set, and for benchmark context Artificial Analysis' results page plus its launch article show where GLM still sits below the very top closed models.

Planning quality

The cleanest comparison in the evidence is still the narrow one. kilocode's GLM versus Claude Fable 5 run used the same prompt, task, and rubric, and landed a 9.0 versus 9.1 split on a feature-flag service plan.

What mattered in that task was not syntax. According to kilocode on the planning benchmark, the trap was deterministic rollout math, and kilocode's judgment breakdown said the models mostly agreed on the hard calls: keep environment data out of the rollout hash, use fast SHA-256 for API keys, cache unknown-flag lookups. Fable pulled ahead because, as kilocode's judgment breakdown put it, it made one create-time cache hazard explicit that GLM left implicit.

That same thread also checked build execution after planning. kilocode's build follow-up said both models built from GLM's winning plan, both picked the same 77 users for a 35 percent rollout, and GLM passed 15 of 15 live checks while Kimi passed 14.

Benchmark spread

Across public evaluator posts, the headline is less "GLM beat everything" than "GLM cleared the open-weight tier by a lot." ValsAI's Vibe Code Bench result put GLM-5.2 at 64 percent on Vibe Code Bench v1.1, 14 points ahead of the next open-weight model, and ValsAI's generational chart said that was up from 31.5 percent for GLM-5.1 and 3.1 percent for GLM 4.6.

The same evaluator stack got broader fast:

  • ValsAI's follow-up said GLM-5.2 became the top open-weight model on the Vals Index, Vibe Code Bench, and Terminal Bench 2.1.
  • ValsAI's broader index post said it also ranked first on Harvey's Legal Agent Benchmark, Finance Agent v2, and ProofBench inside Vals' in-house suite.
  • Artificial Analysis on CritPt said GLM-5.2 scored 20.9 percent on CritPt, matching Claude Opus 4.8 and far above the next open model at 12.9 percent.
  • Agent Arena's leaderboard post put GLM-5.2 Max at #10 overall and #1 among open models on its long-horizon agent leaderboard, while noting a steerability tradeoff versus GLM-5.1.
  • WesRoth on Design Arena said GLM-5.2 reached first place on Design Arena with 1360 Elo.

The ceiling still shows up on harder knowledge-work evaluations. Artificial Analysis on AA-Briefcase ranked GLM-5.2 Max third behind Claude Fable 5 and Claude Opus 4.8, at 1266 Elo versus Opus 4.8 at 1356 and Fable 5 at 1587, with GLM's average task cost at $2.40.

Harnesses

The most useful part of this story for working engineers is how little ceremony people needed to try it. multimodalart's Claude Code recipe showed the minimal path through Hugging Face Inference Providers, and aibuilderclub_'s Claude Code env vars posted the fuller environment-variable setup for swapping GLM into Claude Code.

The rest of the harness map filled in quickly:

That harness portability is part of the appeal in the reactions. MaximeRivest on control framed the model as something you can host yourself if a provider turns sour, and matvelloso after a full day called it the first open model that cleared his daily-driver bar.

Gaps

The field reports were enthusiastic, but they were not uniform. jeremyphoward's vision note called image handling the one big gap, and mattpocockuk on vision separately described vision as disappointing.

Other limits showed up as workflow friction:

There is also a small but telling weirdness tax when GLM is dropped into Anthropic-shaped tooling. peakcooper's Pi harness post showed GLM insisting it was Claude until it checked local agent config.

Under the hood

The evidence pool does surface a few concrete reasons the model behaves differently. rasbt on GLM-5.2 architecture said GLM-5.2 keeps GLM-5's MLA and DSA setup and adds IndexShare, which reuses sparse-attention top-k indices across four layers to make 1M-token inference cheaper.

On the training side, Cedric Chee's notes extracted the main post-training mechanics from Z.ai's release material:

Those details line up with how users described the model. haider1 on long multi-step work praised context retention across a 12-step refactor, while cedric_chee on a voxel pagoda run said GLM handled planning, implementation, verification, and work management on a long-running build.

Local and self-hosted runs

The last useful reveal is how aggressively the ecosystem moved to make a 753B-class open model practical outside the first-party API. UnslothAI's local quantization post said its 2-bit version shrank GLM-5.2 from 1.51 TB to 238 GB while retaining about 82 percent accuracy, enough for 256 GB Mac or RAM and VRAM setups.

Serving recipes also got specific. gneubig on SGLang cookbooks pointed to hardware-specific SGLang guides, and vllm_project on self-hosting said vLLM can serve GLM-5.2 behind an OpenAI Responses-compatible API so Codex-style agents can target self-hosted inference as a drop-in endpoint.

That is why so many reactions focused on control as much as raw benchmark rank. MaximeRivest on control liked that he could host it himself, and MaximeRivest on owned hardware argued that the bigger shift is not just a near-frontier local model, but what a small team can build once a model at this level sits on hardware they own.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 8 threads
TL;DR8 posts
Fast paths3 posts
Planning quality2 posts
Benchmark spread7 posts
Harnesses10 posts
Gaps8 posts
Under the hood5 posts
Local and self-hosted runs4 posts
Share on X