updateJune 20, 2026

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

Independent results put GLM-5.2 at the top of the open-model DeepSWE board and near the top on debate and post-train evals. Watch token use and long reasoning traces, which can offset its headline price advantage.

7 min read

GLM-5.2 ranks #1 on DeepSWE with 44% pass@1

TL;DR

datacurve's DeepSWE result put GLM-5.2 at 44% pass@1 at max effort, ahead of Kimi K2.7 Code by 17%, while scaling01's follow-up noted that the same setup can beat GPT-5.5 low and Opus 4.8 low on that board.
The cheapest headline comparison came from kilocode's planning test, where GLM-5.2 scored 9.0 versus Claude Fable 5's 9.1 on the same rubric, and kilocode's price breakdown paired that near-tie with list pricing of $1.40 in and $4.40 out versus $10 and $50.
Efficiency is the catch: Matt Pocock's harness test logged nearly 220K reasoning tokens in three turns, his follow-up said the third turn reached 560K tokens, and Theo's cost warning argued that cheaper tokens can still mean slower, heavier runs.
The broader eval picture is strong, not just one leaderboard: Lech Mazur's debate benchmark thread put GLM-5.2 Max in second behind Claude models, while ValsAI's benchmark summary said it led several of Vals' in-house agent and proof benchmarks.
Access moved fast after release. Together AI's announcement, Hyper's post, and Unsloth's local run guide all shipped support within days, but browser_use's reply and Jeremy Howard's note both flagged the same gap: GLM-5.2 is still blind to images.

You can compare the same-prompt planning shootout, inspect the DeepSWE leaderboard, and read the official GLM-5.2 release. There is also a useful architecture breakdown from Sebastian Raschka on the new IndexShare trick, plus a quantized local run guide from Unsloth that makes the deployment story more concrete.

DeepSWE

The cleanest headline is still datacurve's leaderboard post: top open model on DeepSWE, 44% pass@1 at max effort.

The board matters because it is not a toy prompt test. datacurve's update points to the live leaderboard, and scaling01's chart adds the more awkward part of the story: GLM-5.2 gets that score with much worse reasoning efficiency than the closed models clustered above it.

The gap is visible in two dimensions from the attached charts:

Score: GLM-5.2 Max lands at 44% pass@1, below Fable 5 at 70% and GPT-5.5 medium at 54% scaling01's DeepSWE bar chart.
Token use: the output-token scatter in places GLM-5.2 well left of the efficient frontier, meaning more output tokens per task for less score.
Cost framing: Theo's comparison said Opus 4.8 medium and GPT-5.5 medium can both end up cheaper and smarter once you factor in volume, not just per-token list price.

Planning price compression

The most shareable GLM-5.2 result is not a benchmark chart. It is kilocode's controlled planning test, which ran the same feature-flag planning task against GLM-5.2 and Claude Fable 5 with the rubric fixed in advance.

Kilocode's breakdown split the result into three concrete pieces:

Score: GLM-5.2 got 9.0, Fable 5 got 9.1 kilocode's planning comparison.
Shared decisions: both plans made the same calls on rollout hashing, fast SHA-256 for API keys, and caching unknown-flag lookups, according to kilocode's detailed callouts.
Remaining gap: Fable spelled out a create-time cache trap that GLM left implicit, per the same callout thread.

That is where the pricing claim comes from. Kilocode's graphic compared Fable's $10 in and $50 out to GLM-5.2's $1.40 in and $4.40 out, then kilocode's caveat immediately narrowed the claim to what it was: one task, one run, a marker instead of a verdict.

Bench spread

DeepSWE is not alone. The interesting part is how often GLM-5.2 shows up near the top on evals that do not look alike.

Across the evidence set, the strongest benchmark claims break down like this:

Debate: Lech Mazur's thread put GLM-5.2 Max in second place behind Claude on the LLM Debate Benchmark, which uses side-swapped multi-turn matchups and Bradley-Terry ratings.
Agentic coding: kilocode's GLM versus Kimi test scored GLM-5.2's plan at 9.0 versus Kimi K2.7 Code's 8.1 on a plan-then-build backend task.
In-house eval mix: ValsAI's summary said GLM-5.2 ranked first on the Vals Index, Harvey's Legal Agent Benchmark, Finance Agent v2, ProofBench, and Vibe Code Bench.
Post-train aggregation: shows GLM-5.2 at the top of that community-maintained table, slightly ahead of Opus 4.8 in average score.
Real-world agent arena: Agent Arena's ranking moved GLM-5.2 Max to #10 overall and #1 open model, while also reporting worse steerability than GLM-5.1.

The shape of the evidence is consistent: stronger planning, coding, and long-horizon results than open-model users are used to seeing, but not a clean sweep against the best closed models.

Reasoning token burn

The most repeated criticism in hands-on reports is not quality. It is overthinking.

Three different complaints line up here:

Matt Pocock's test said GLM-5.2 produced nearly 220K thinking tokens in three turns while teaching cube solving through a tool harness.
his follow-up said the third turn alone reached 560K tokens.
Anders Lie's post argued that open models often maximize a single reasoning span because benchmark training rewards it, then described a provider-side patch that forcibly interrupts runaway traces and pushes the model toward action.

More user reports point the same way. Maxime Rivest's PDF conversion example logged 178 tool calls and 64,000 tokens for a single blog-to-PDF task, and bridgemindai's critique said GLM-5.2 often needs three or four prompts for work that GPT-5.5 or Opus 4.8 finishes in one.

That makes GLM-5.2 feel like two stories at once: frontier-adjacent open weights on quality, and a model that still spends too many tokens getting there.

Where it shows up

The rollout was quick enough to look like pent-up demand.

Within days, the model showed up across a wide spread of surfaces:

Hosted inference: Together AI added GLM-5.2 with 1M context and tool-heavy agent positioning Together AI's announcement.
Privacy-focused coding surface: Hyper added it with 1M context, zero data retention, and MIT licensing called out in the launch copy Hyper's post.
Browser agents: browser_use's BrowserCode demo claimed near Opus-level results on a task that cost $0.18.
Internal enterprise serving: Yuchenj_UW's reply said their team had already stood up a Databricks endpoint after the weights opened.
Local and compressed runs: Unsloth's guide said its 2-bit version shrank the model from 1.51 TB to 238 GB while retaining about 82% accuracy.
Consumer model platforms: Ollama's usage update said demand was high enough to double cloud GPU capacity.

The open-weight part of the story is not abstract. People immediately started asking which harness, provider, and serving recipe to use, as Matt Pocock's question and Graham Neubig's serving question make obvious.

Vision gap

The missing capability is unusually clear because multiple otherwise bullish users hit the same wall.

GLM-5.2 does not support images, according to browser_use's direct reply, and Jeremy Howard's follow-up called that blindness the one big gap after otherwise praising the model's speed, judgment, and long-context handling.

That absence leaks into product usage immediately. Browser-based agent tasks can still look good when they reduce to text and tool use, but vision-heavy workflows remain a closed-model advantage for now. It is a narrow limitation compared with the rest of the launch, but it is also the easiest one to explain in a sentence: GLM-5.2 can now sit in serious coding and agent stacks, just not the ones that need to see.