releaseMarch 20, 2026

NVIDIA releases Nemotron-Cascade 2 30B-A3 with IMO gold-level claims and Ollama support

NVIDIA published Nemotron-Cascade 2, a 30B MoE with 3B active parameters, claiming IMO gold-level math and Kimi K2.5-class code scores, then pushed it to Hugging Face and Ollama. It is worth testing if you want an open agent model with immediate local and hosted paths.

Coding Agents Benchmarks Reinforcement Learning

3 min read

NVIDIA releases Nemotron-Cascade 2 30B-A3 with IMO gold-level claims and Ollama support

TL;DR

NVIDIA has released Nemotron-Cascade 2 as an open 30B MoE with 3B activated parameters, and the launch summary and HF release post both point to public paper and model artifacts.
NVIDIA is claiming unusually strong reasoning and coding results: the paper card says the model reached “Gold Medal-level” IMO performance and is “on par with Kimi K2.5 on LiveCodeBench.”
The release moved quickly into usable distribution channels: the HF release post put it on Hugging Face, while Ollama’s Ollama announcement added a one-command local run path.
Early deployment work started immediately around the release, with the quantization post offering MLX and GGUF 5-bit variants and Ollama’s model page thread describing thinking, instruct, and agent-oriented usage.

What shipped and what NVIDIA is claiming

@_akhaliq

·Follow

Nvidia just released Nemotron-Cascade 2 on Hugging Face paper: huggingface.co/papers/2603.19… model: huggingface.co/collections/nv…

6:31 PM · Mar 20, 2026

Read 7 replies

Dan McAteer

@daniel_mac8

·Follow

Nvidia released Nemotron-Cascade 2. A 30B-A3 MoE open model on par with Kimi K2.5 on LiveCodeBench. It achieved IMO gold level!

Wei Ping

@_weiping

🚀 Introducing Nemotron-Cascade 2 🚀 Just 3 months after Nemotron-Cascade 1, we’re releasing Nemotron-Cascade 2: an open 30B MoE with 3B active parameters, delivering best-in-class reasoning and strong agentic capabilities. 🥇 Gold Medal-level performance on IMO 2025, IOI

11:21 AM · Mar 20, 2026

276

Read 7 replies

Nemotron-Cascade 2 is a new open model release centered on a 30B MoE architecture with 3B active parameters. The Hugging Face post links both the paper and model collection, while the paper page and model collection make this more than a benchmark teaser: there are public assets engineers can inspect and pull into existing workflows.

The headline claims are aggressive. NVIDIA’s paper card says the model achieves “Gold Medal-level performance” on the 2025 IMO and shows comparisons against DeepSeek-V3.5-35B-A3B and Kimi-K2.5-17-Thinking across LiveCodeBench, SWE Verified OpenHands, Humanity’s Last Exam, and ArenaHard v2. That same card describes the release as “Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation,” which is the main technical framing for how NVIDIA says it got there.

How you can run it today

ollama

@ollama

·Follow

Nemotron-Cascade-2 is now available to run with Ollama. ollama run nemotron-cascade-2 To run it locally with OpenClaw: ollama launch openclaw --model nemotron-cascade-2 This model from NVIDIA delivers strong reasoning and agentic capabilities on par with models with up to 20x Show more

8:18 PM · Mar 20, 2026

566

Read 30 replies

Adrien Brault-Lesage

@AdrienBrault

·Follow

Couldn't find any quants so I made some: MLX 5-bit: huggingface.co/AdrienBrault/N… GGUF Q5_K_M: huggingface.co/AdrienBrault/N… GGUF Q5_1: huggingface.co/AdrienBrault/N…

Wei Ping

@_weiping

9:55 AM · Mar 20, 2026

Read 6 replies

The practical part of this launch is that it already has local runtime paths. Ollama’s announcement says you can run it with ollama run nemotron-cascade-2, and its model page positions the model for “reasoning and agentic capabilities” rather than as a generic chat checkpoint.

Ollama’s follow-up model page thread adds a few deployment details that matter: the page describes thinking and instruct modes, mentions use in tools like OpenClaw, and highlights a 24GB variant with a 256K context window. Separately, the quantization post shows the community is already adapting the model for constrained hardware, with MLX 5-bit and GGUF Q5 variants on Hugging Face via the MLX build and the GGUF build. The GGUF summary says the quantized runtime footprint is about 26 GB, which puts local testing within reach on a single high-memory workstation rather than only server GPUs.

🧾 More sources

TL;DR1 tweets

High-level summary of the release, benchmark claims, and immediate availability paths.

What shipped and what NVIDIA is claiming2 tweets

Covers the formal release artifacts and the benchmark/performance claims attached to the model.

How you can run it today1 tweets

Focuses on actual deployment paths available now, including Ollama distribution and community quantizations.