releaseMarch 20, 2026

NVIDIA releases Nemotron-Cascade 2 30B-A3 with IMO gold-level claims and Ollama support

NVIDIA published Nemotron-Cascade 2, a 30B MoE with 3B active parameters, claiming IMO gold-level math and Kimi K2.5-class code scores, then pushed it to Hugging Face and Ollama. It is worth testing if you want an open agent model with immediate local and hosted paths.

3 min read

NVIDIA releases Nemotron-Cascade 2 30B-A3 with IMO gold-level claims and Ollama support

TL;DR

NVIDIA has released Nemotron-Cascade 2 as an open 30B MoE with 3B activated parameters, and the launch summary and HF release post both point to public paper and model artifacts.
NVIDIA is claiming unusually strong reasoning and coding results: the paper card says the model reached “Gold Medal-level” IMO performance and is “on par with Kimi K2.5 on LiveCodeBench.”
The release moved quickly into usable distribution channels: the HF release post put it on Hugging Face, while Ollama’s Ollama announcement added a one-command local run path.
Early deployment work started immediately around the release, with the quantization post offering MLX and GGUF 5-bit variants and Ollama’s model page thread describing thinking, instruct, and agent-oriented usage.

What shipped and what NVIDIA is claiming

Nemotron-Cascade 2 is a new open model release centered on a 30B MoE architecture with 3B active parameters. The Hugging Face post links both the paper and model collection, while the paper page and model collection make this more than a benchmark teaser: there are public assets engineers can inspect and pull into existing workflows.

The headline claims are aggressive. NVIDIA’s paper card says the model achieves “Gold Medal-level performance” on the 2025 IMO and shows comparisons against DeepSeek-V3.5-35B-A3B and Kimi-K2.5-17-Thinking across LiveCodeBench, SWE Verified OpenHands, Humanity’s Last Exam, and ArenaHard v2. That same card describes the release as “Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation,” which is the main technical framing for how NVIDIA says it got there.

How you can run it today

The practical part of this launch is that it already has local runtime paths. Ollama’s announcement says you can run it with ollama run nemotron-cascade-2, and its model page positions the model for “reasoning and agentic capabilities” rather than as a generic chat checkpoint.

Ollama’s follow-up model page thread adds a few deployment details that matter: the page describes thinking and instruct modes, mentions use in tools like OpenClaw, and highlights a 24GB variant with a 256K context window. Separately, the quantization post shows the community is already adapting the model for constrained hardware, with MLX 5-bit and GGUF Q5 variants on Hugging Face via the MLX build and the GGUF build. The GGUF summary says the quantized runtime footprint is about 26 GB, which puts local testing within reach on a single high-memory workstation rather than only server GPUs.

TL;DR

What shipped and what NVIDIA is claiming

How you can run it today

Discussion across the web