breakingJune 20, 2026

Ollama raises GLM-5.2 cloud capacity on NVIDIA B300s

Ollama said it doubled GPU capacity for GLM-5.2 cloud usage and said the model is currently hosted only in the US. The rollout adds capacity as open-model demand climbs, so users should check hosting and privacy details before deploying.

4 min read

Ollama raises GLM-5.2 cloud capacity on NVIDIA B300s

TL;DR

Ollama said GLM-5.2 demand spiked hard enough that it doubled cloud GPU capacity, with a follow-up reply saying the extra capacity had already been turned on.
According to Ollama's launch thread, GLM-5.2 ships with a 1M-token context window and is positioned for long-horizon coding and agentic tasks on Ollama's cloud.
Ollama's US hosting reply and a later clarification both said the GLM-5.2 cloud deployment is currently US-hosted, even though Ollama said elsewhere in the thread that it has some scaling capacity in Europe and very little in Singapore regional capacity reply.
Ollama tied the rollout to its privacy policy in multiple replies, saying the policy includes zero data retention and that user data is not used or sold privacy-policy reply.
Access is already wired into multiple surfaces, with the launch thread listing ollama run glm-5.2:cloud plus launch commands for Claude Code, Codex App, and Hermes Agent.

Ollama's own replies did most of the story here. The launch thread bundled the 1M-token context claim with ready-to-run commands, the capacity update pinned the cloud fleet to NVIDIA B300 Blackwell GPUs in the US, and the reply chain kept narrowing the operational detail, from US-only hosting for GLM-5.2 to a separate note that Ollama has some broader scaling capacity in Europe and very little in Singapore.

Capacity spike on B300s

The sequence was fast. Ollama's launch thread introduced GLM-5.2 on cloud, then a support reply described demand as "another record breaking one" before a second follow-up in the same thread said the company had doubled capacity.

A few hours later, Ollama's main capacity post specified what changed: more US-based cloud capacity on NVIDIA B300 Blackwell GPUs. That is a concrete infrastructure detail most launch threads skip.

US hosting, with broader regional capacity elsewhere

Ollama answered the hosting question several times, and the wording stayed narrow. One reply said GLM-5.2 is currently hosted only in the US, while another separated that fact from the broader privacy discussion.

The same reply chain added a second detail: Ollama's regional-capacity note said the company has some scaling capacity in Europe and very little in Singapore. Read together, that suggests GLM-5.2 cloud is US-only for now, even if Ollama's wider infrastructure footprint is not.

Privacy policy became part of the rollout

Instead of publishing a fresh infra note, Ollama kept pointing users to its privacy policy. In replies, the company and the company again said the policy states that user data is not used or sold, and the original launch thread paired that with a zero-data-retention claim.

That made privacy part of the operational story, not just brand copy, because the same thread was also answering where inference runs.

Commands and model surfaces

The launch thread broke access into four entry points:

ollama run glm-5.2:cloud for chat, per the launch thread
ollama launch claude --model glm-5.2:cloud, per the Claude Code command
ollama launch codex-app --model glm-5.2:cloud, per the Codex App command
ollama launch hermes --model glm-5.2:cloud, per the Hermes Agent command

An availability reply also said the model was usable immediately after launch.

Local runtimes and quantization are a separate story

Cloud capacity was the headline, but Ollama's replies also drew a line between hosted access and local packaging. One reply said smaller quantizations can make local runs feasible, though quality varies today.

Another reply contrasted Ollama with raw llama.cpp usage, saying its packaging work is meant to preserve repeated tool-call reliability, installation sanity, and model behavior rather than shipping oddly compressed or out-of-context quantizations. That is a useful clue about how Ollama wants GLM-5.2 consumed, especially for agent-style workloads.