Skip to content
AI Primer
release

Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s

Vercel and Wafer launched a serverless GLM-5.2 endpoint on AI Gateway with 1M context and published pricing. Teams get a high-throughput open-model option inside an existing gateway instead of managing GLM inference directly.

3 min read
Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s
Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s

TL;DR

  • Vercel added a serverless zai/glm-5.2-fast endpoint to AI Gateway, and wafer_ai's launch post says Wafer is the serving partner.
  • The launch post from wafer_ai priced the endpoint at $3.00 per 1M input tokens, $10.25 per 1M output tokens, plus $0.50 cached input, with a 1M-token context window.
  • Throughput is the headline: wafer_ai's benchmark chart claimed 150 to 250 tok/s, while vercel_dev's announcement described it as 2x faster token throughput than other providers in internal benchmarks.
  • The rollout also expands AI Gateway's model mix, because jediahkatz's support note said this was the first GLM model the gateway team had supported.

You can jump straight to the AI Gateway model link, inspect wafer_ai's throughput chart, and see rauchg's reply confirm that Wafer is the inference partner behind this specific integration.

Model slug

The practical update is simple: GLM-5.2 Fast is now another hosted option inside Vercel's existing gateway, under the slug zai/glm-5.2-fast, according to vercel_dev's announcement and wafer_ai's launch post.

When asked who was behind the deployment, rauchg's reply said Vercel works with multiple inference partners and that Wafer is the partner in this case.

Throughput chart

The performance claim came with a concrete chart. wafer_ai's benchmark post advertised 150 to 250 tok/s, and the attached screenshot showed Wafer peaking at 245.5 TPS on AI Gateway traffic, ahead of Baseten at 121.0, Fireworks at 80.5, and Z.ai at 53.0.

That is narrower than a broad "fast model" claim, and more useful. The comparison is specifically about provider throughput for GLM traffic routed through AI Gateway, not a model-quality benchmark.

Price and context window

The commercial terms were in the launch tweet itself:

For model context, eliebakouch's CursorBench comment called GLM 5.2 roughly as cost efficient as Opus 4.8 on CursorBench, while natolambert's post argued that performance in that range puts pressure on frontier-model margins.

First GLM support

The small but notable platform detail came after launch: jediahkatz's support note said this was AI Gateway's first GLM integration and asked users to report odd behavior.

Wafer also used the thread to position the underlying model as more than a speed demo. In wafer_ai's DeepSWE reply, the company said GLM 5.2 ranks at or near the top on coding benchmarks and is the number one open-source model on DeepSWE.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
Price and context window2 posts
First GLM support1 post
Share on X