updateJune 19, 2026

Engineers report GLM-5.2 matches near-Opus planning at about 1/10 the price

Independent tests put GLM-5.2 near Opus 4.8 and GPT-5.5 on planning and coding, and users shared Claude Code, BrowserCode, dcode, and local-serving recipes. It matters because many engineers are treating it as a daily-driver option for text-heavy coding, though teams still report weaker vision and provider limits.

7 min read

Engineers report GLM-5.2 matches near-Opus planning at about 1/10 the price

TL;DR

In Kilo Code's planning test, kilocode's GLM versus Claude Fable 5 run scored GLM-5.2 at 9.0 and Fable 5 at 9.1 on the same rubric, while kilocode's price comparison put GLM at $1.40 input and $4.40 output per million tokens versus Fable's $10 and $50.
On broader coding evals, ValsAI's Vibe Code Bench result said GLM-5.2 was the first open-weight model above 60 percent, and ValsAI's follow-up said it also took the top open-weight spot on Terminal Bench 2.1 and the Vals Index.
Hands-on reports from jeremyphoward's first impression, matvelloso after a full day, and Yuchenj_UW after side-by-side use all described GLM-5.2 as close to current proprietary coding models for text-heavy work.
The recurring caveat is vision and polish: jeremyphoward's vision note called blindness the main gap, while bridgemindai's tradeoff post argued GLM often needs more prompting than GPT-5.5 or Opus 4.8 to land the same result.
Engineers were already wiring it into existing harnesses through multimodalart's Claude Code recipe, hwchase17 on dcode, browser_use in BrowserCode, and UnslothAI's local quantization post.

Fast paths

You can skim Kilo Code's full breakdown, grab a local quantization guide, or use a vLLM serving guide plus a GLM serving recipe. For architecture detail, Sebastian Raschka's earlier GLM write-up is the cleanest pointer in the evidence set, and for benchmark context Artificial Analysis' results page plus its launch article show where GLM still sits below the very top closed models.

Planning quality

The cleanest comparison in the evidence is still the narrow one. kilocode's GLM versus Claude Fable 5 run used the same prompt, task, and rubric, and landed a 9.0 versus 9.1 split on a feature-flag service plan.

What mattered in that task was not syntax. According to kilocode on the planning benchmark, the trap was deterministic rollout math, and kilocode's judgment breakdown said the models mostly agreed on the hard calls: keep environment data out of the rollout hash, use fast SHA-256 for API keys, cache unknown-flag lookups. Fable pulled ahead because, as kilocode's judgment breakdown put it, it made one create-time cache hazard explicit that GLM left implicit.

That same thread also checked build execution after planning. kilocode's build follow-up said both models built from GLM's winning plan, both picked the same 77 users for a 35 percent rollout, and GLM passed 15 of 15 live checks while Kimi passed 14.

Benchmark spread

Across public evaluator posts, the headline is less "GLM beat everything" than "GLM cleared the open-weight tier by a lot." ValsAI's Vibe Code Bench result put GLM-5.2 at 64 percent on Vibe Code Bench v1.1, 14 points ahead of the next open-weight model, and ValsAI's generational chart said that was up from 31.5 percent for GLM-5.1 and 3.1 percent for GLM 4.6.

The same evaluator stack got broader fast:

ValsAI's follow-up said GLM-5.2 became the top open-weight model on the Vals Index, Vibe Code Bench, and Terminal Bench 2.1.
ValsAI's broader index post said it also ranked first on Harvey's Legal Agent Benchmark, Finance Agent v2, and ProofBench inside Vals' in-house suite.
Artificial Analysis on CritPt said GLM-5.2 scored 20.9 percent on CritPt, matching Claude Opus 4.8 and far above the next open model at 12.9 percent.
Agent Arena's leaderboard post put GLM-5.2 Max at #10 overall and #1 among open models on its long-horizon agent leaderboard, while noting a steerability tradeoff versus GLM-5.1.
WesRoth on Design Arena said GLM-5.2 reached first place on Design Arena with 1360 Elo.

The ceiling still shows up on harder knowledge-work evaluations. Artificial Analysis on AA-Briefcase ranked GLM-5.2 Max third behind Claude Fable 5 and Claude Opus 4.8, at 1266 Elo versus Opus 4.8 at 1356 and Fable 5 at 1587, with GLM's average task cost at $2.40.

Harnesses

The most useful part of this story for working engineers is how little ceremony people needed to try it. multimodalart's Claude Code recipe showed the minimal path through Hugging Face Inference Providers, and aibuilderclub_'s Claude Code env vars posted the fuller environment-variable setup for swapping GLM into Claude Code.

The rest of the harness map filled in quickly:

hwchase17 on dcode recommended dcode as a more model-agnostic harness than Claude Code or Codex for GLM testing.
browser_use in BrowserCode said a BrowserCode task cost $0.18 and described the score as near Opus level.
opencode's availability post announced GLM-5.2 in Go with 1M context at GLM-5.1 pricing.
ollama's launch thread put it on Ollama Cloud for Claude Code, Codex App, Hermes Agent, and chat.
ollama on Codex later added direct Codex launch commands for GLM-5.2 and Kimi-K2.7-Code.
Together AI's launch note highlighted 1M context, flexible effort levels, and IndexShare on hosted inference.
FactoryAI on Droid said GLM-5.2 was already live in Droid.

That harness portability is part of the appeal in the reactions. MaximeRivest on control framed the model as something you can host yourself if a provider turns sour, and matvelloso after a full day called it the first open model that cleared his daily-driver bar.

Gaps

The field reports were enthusiastic, but they were not uniform. jeremyphoward's vision note called image handling the one big gap, and mattpocockuk on vision separately described vision as disappointing.

Other limits showed up as workflow friction:

bridgemindai's tradeoff post said GLM often takes three or four prompts to reach what GPT-5.5 or Opus 4.8 can sometimes do in one.
Agent Arena's leaderboard post reported a steerability drop versus GLM-5.1 even as overall outcomes improved.
AmpCode's status post and AmpCode's resolution post showed early availability issues during demand spikes.
ollama on capacity said Ollama had to double GPU capacity for GLM-5.2 usage, and ollama on capacity follow-up was still telling users more capacity was coming.
petergostev on Bullshit Benchmark said GLM-5.2 underperformed on Bullshit Benchmark relative to the hype cycle around coding.

There is also a small but telling weirdness tax when GLM is dropped into Anthropic-shaped tooling. peakcooper's Pi harness post showed GLM insisting it was Claude until it checked local agent config.

Under the hood

The evidence pool does surface a few concrete reasons the model behaves differently. rasbt on GLM-5.2 architecture said GLM-5.2 keeps GLM-5's MLA and DSA setup and adds IndexShare, which reuses sparse-attention top-k indices across four layers to make 1M-token inference cheaper.

On the training side, Cedric Chee's notes extracted the main post-training mechanics from Z.ai's release material:

cedric_chee on agentic RL said GLM-5.2 moved from GRPO toward critic-based PPO for long coding trajectories.
cedric_chee on agentic RL said very long traces are compacted into sub-traces and all reused with token-level loss.
cedric_chee on reward hacking said Z.ai used an LLM judge to evaluate tool-call intent so reward-hacking cases did not kill the entire trajectory.
teortaxesTex on OPD training highlighted a release note claiming the final OPD training process took about two days after expert-model merging.

Those details line up with how users described the model. haider1 on long multi-step work praised context retention across a 12-step refactor, while cedric_chee on a voxel pagoda run said GLM handled planning, implementation, verification, and work management on a long-running build.

Local and self-hosted runs

The last useful reveal is how aggressively the ecosystem moved to make a 753B-class open model practical outside the first-party API. UnslothAI's local quantization post said its 2-bit version shrank GLM-5.2 from 1.51 TB to 238 GB while retaining about 82 percent accuracy, enough for 256 GB Mac or RAM and VRAM setups.

Serving recipes also got specific. gneubig on SGLang cookbooks pointed to hardware-specific SGLang guides, and vllm_project on self-hosting said vLLM can serve GLM-5.2 behind an OpenAI Responses-compatible API so Codex-style agents can target self-hosted inference as a drop-in endpoint.

That is why so many reactions focused on control as much as raw benchmark rank. MaximeRivest on control liked that he could host it himself, and MaximeRivest on owned hardware argued that the bigger shift is not just a near-frontier local model, but what a small team can build once a model at this level sits on hardware they own.