Skip to content
AI Primer
breaking

GLM-5.2 ranks 30/99 on PrinzBench as testers report legal hallucinations

PrinzBench added GLM-5.2 and scored it 30/99 for legal research, while a separate LisanBench run placed GLM-5.2-high at #29 and noted high token use. The result matters because it cuts against code-centric GLM hype and points to weak search, statute fidelity, and reasoning on professional legal tasks.

4 min read
GLM-5.2 ranks 30/99 on PrinzBench as testers report legal hallucinations
GLM-5.2 ranks 30/99 on PrinzBench as testers report legal hallucinations

TL;DR

  • deredleritt3r's PrinzBench post scored GLM-5.2 at 30/99 for legal research, below Gemini 3 Pro at 35/99 and far behind GPT-5.5 at 74/99.
  • According to deredleritt3r's search sub-score reply, GLM-5.2 also posted a weak 5/24 on the benchmark's search component, which matters because the test uses the same web chat tools enterprise users actually get.
  • scaling01's LisanBench post found GLM-5.2-high improved sharply over GLM-5, but still ranked only #29 and used about 32k output tokens versus roughly 19k for Kimi-K2.5 Thinking at a similar score.
  • Hangsiin's test report and scaling01's edit-distance failure note both point to a more basic problem than raw score optics: testers reported long waits, poor answers, and a dominant failure mode where the model simply did not follow task rules.

You can scan the PrinzBench chart for where GLM-5.2 landed relative to Gemini, Grok, and Opus, then compare it with the LisanBench leaderboard and a separate failure-pattern chart showing the model still inherits some of the hacky behavior other reasoning models have shown.

PrinzBench

On PrinzBench, GLM-5.2 landed in the crowded middle of the pack. deredleritt3r's chart placed it at 30/99, behind Gemini 3 Pro and Kimi-K2.5 Thinking at 35, and well behind Grok 4.20 and Opus 4.8 at about 42.

The sharper claim came from the writeup around the score. In the same post, deredleritt3r said GLM-5.2 produced inconsistent results and hallucinated statutory provisions that were not actually there.

Web chat setup

The benchmark was not run in a stripped API harness. In a reply about methodology, deredleritt3r said the tests were done in the web chat because that is how enterprise legal users actually use these systems, and that included the bundled search tools.

That matters for interpreting the low score. According to deredleritt3r's follow-up, GLM-5.2 scored just 5/24 on the search sub-score, and in another reply he said he used Deep Think Max with Advanced Search on the provider's own surface.

LisanBench

LisanBench tells a narrower story. scaling01's post said GLM-5.2-high ranked #29 with a score of 1834, up from GLM-5's 986.83, so the new model is better than its predecessor even if it still trails stronger open models.

The more interesting number was token use:

  • scaling01's comparison said Kimi-K2.5 Thinking scored about the same while averaging roughly 19k tokens.
  • The same post put GLM-5.2-high at roughly 32k tokens.
  • In a follow-up on reasoning efficiency, scaling01 summarized the tradeoff as Kimi over GLM over DeepSeek, with DeepSeek scoring higher mostly by spending more tokens.

So the two public reads were not actually contradictory. LisanBench suggested measurable progress over older GLM versions, while PrinzBench said that progress still does not cash out into dependable legal research.

Failure modes

LisanBench's breakdowns gave the strongest clue about why the model still underwhelmed. In one failure analysis, scaling01 said GLM-5.2 failed in 76.7% of runs because it did not respect the edit-distance-1 rule.

A separate bridge-pattern chart said the model still showed the same kind of hacky insert-substitute-delete behavior seen in Opus 4.5, though at a lower rate than GLM 5.

That lines up with the legal benchmark complaints in deredleritt3r's original post: not just weaker final answers, but a model that drifts off the task in ways higher-scoring systems do less often.

Serving quirks

Some of the rough edges were operational, not only cognitive. In Hangsiin's report, Max mode sometimes produced no response even after dozens of minutes, and the tester said the website results were poor too.

There was also rollout instability. deredleritt3r's reply about Anthropic testing said one benchmarking run on another model was interrupted because the model got pulled halfway through, while his GLM settings reply argued that if a provider cannot serve its own model reliably on its own site, that is part of the product being evaluated.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
Web chat setup1 post
LisanBench1 post
Share on X