Skip to content
AI Primer
breaking

Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

Wafer said its GLM-5.2 deployment leads Artificial Analysis on throughput and latency, and priced usage at $1.20 input and $4.10 output per million tokens. Compare serverless and dedicated endpoints if you need speed at scale.

3 min read
Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end
Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

TL;DR

You can [hit the endpoint here]Wafer's GLM-5.2 page, pull up the [Hermes Agent setup docs]Hermes Agent docs, and compare the [Artificial Analysis-style benchmark chart]Wafer's launch post against Wafer's later [DeepSWE claim]Wafer's DeepSWE post. Jeremy Howard's early reaction in Jeremy Howard's GLM-5.2 post is also unusually strong for an open-weights release.

Artificial Analysis

Wafer's main claim is simple: fastest GLM-5.2 inference among listed providers on Artificial Analysis.

The attached chart in Wafer's launch post shows Wafer at 222 output tok/s, ahead of GMI at 173 and Together AI at 155. The same chart puts Wafer at 12.6 seconds end-to-end, ahead of Together AI at 16.9 and GMI at 17.2.

A later post, Wafer's DeepSWE post, adds a second bragging right: GLM-5.2 as the top open-source model on DeepSWE at 44%.

Pricing

The pricing story is almost as aggressive as the speed story.

Across one reply, another reply, a third reply, and Wafer's follow-up post, the company repeated the same numbers: $1.20 input, $4.10 output, and in two posts a $0.20 cache price.

The screenshot attached to Wafer's pricing reply places Wafer in the chart's "most attractive quadrant," pairing the lowest price point in the field with the highest output speed.

Serverless

The access model shifted fast enough to show up in replies before it showed up anywhere else.

According to Wafer's sunset notice, Wafer Pass was sunset because demand for serverless and dedicated endpoints was too high. But Wafer's endpoint reply says GLM-5.2 is serverless for now, which makes the near-term offering narrower than the dedicated-endpoint language in Wafer's sales reply suggests.

Wafer is also pointing developers at integration docs. In Wafer's Hermes Agent reply, the company linked setup instructions for Hermes Agent through Nous Research.

Demand

The most concrete signal in the reply thread is not the benchmark chart, it is the repeated capacity warning.

Wafer told multiple users in one reply, another, a third, and a fourth that demand was surging and more compute was being added. That lines up with Wafer's volume comment, which says the company had already seen enormous GLM-5.2 inference volume.

Outside Wafer's own account, Simon Willison's post framed the release as a race for ultra-fast inference providers, while Jeremy Howard's GLM-5.2 post and Vipul Ved's hands-on note both described the model itself as unusually strong.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR3 posts
Pricing3 posts
Serverless3 posts
Demand6 posts
Share on X