releaseJune 4, 2026

NVIDIA releases Nemotron 3 Ultra: 550B MoE, 1M context

NVIDIA shipped Nemotron 3 Ultra, a 550B/55B-active hybrid Mamba-Transformer MoE with open weights, data, and recipe, plus broad runtime and host support. It matters because the model pairs frontier open benchmarks with immediate agent-serving options, though local use still needs heavy quantization or large-memory hardware.

6 min read

NVIDIA releases Nemotron 3 Ultra: 550B MoE, 1M context

TL;DR

NVIDIA shipped Nemotron 3 Ultra as a fully open 550B MoE model with 55B active parameters, a 1M token context window, and open weights, data, and recipe, according to ctnzr's launch post and the official NVIDIA blog.
The main design bet is throughput for long agent runs: ctnzr's architecture note says the model uses very little full attention, while the technical report claims up to 5.9x, 4.8x, and 1.6x higher throughput than GLM-5.1-754B-A40B, Kimi-K2.6-1T-A32B, and Qwen-3.5-397B-17B in its long-output setup.
Third-party evals put it at the front of the US open-weight pack, with ArtificialAnlys scoring it at 47.7 on the Artificial Analysis Intelligence Index and ValsAI placing it fifth overall on the Vals Index at 43.9%.
Day-zero support landed fast across serving stacks and agent products: lmsysorg announced SGLang and Miles support, while vllm_project and vercel_dev posted same-day availability on vLLM and Vercel AI Gateway.
Running it yourself is still a big-box affair. The Hugging Face model card lists a minimum of 8x GB200, B200, GB300, or B300, 16x H100, or 8x H200, while danielhanchen said Unsloth's dynamic 1-bit GGUF still weighs about 190 GB on disk.

You can read NVIDIA's launch post, skim the full 65-page technical report, and browse the Hugging Face model card. The rollout was immediate across vLLM, SageMaker JumpStart, and Vercel AI Gateway. One odd but useful detail came from MaximeRivest's template inspection, which found an XML-style tool format instead of the JSON schemas most tool-calling stacks expect.

What shipped

Nemotron 3 Ultra is NVIDIA's new top-end Nemotron model: 550B total parameters, 55B active, built for long-running agents, and released under OpenMDW 1.1 on Hugging Face through both BF16 and NVFP4 variants.

The model's public positioning is not subtle. NVIDIA and partners framed it around frontier reasoning plus speed, with ArtificialAnlys calling it the strongest US open-weight model on its intelligence index, while ArtificialAnlys' Computex note says Jensen Huang used Artificial Analysis charts in the Computex keynote to present Nemotron 3 Ultra's performance.

Third-party scores are strong but not universal wins. ArtificialAnlys put Nemotron 3 Ultra at 47.7 on its Intelligence Index, ahead of Gemma 4 31B at 39.2 and gpt-oss-120b at 33.3, while its own write-up notes Gemma 4 31B still leads Nemotron on the site's Coding Index by about one point in Terminal-Bench Hard plus SciCode, per Artificial Analysis.

Why it is faster

The core architectural move is to replace most attention-heavy layers with Mamba layers, then keep selective attention where exact recall matters. NVIDIA's launch post pairs that with LatentMoE, NVFP4, and multi-token prediction.

That design shows up in the throughput claims. The technical report says Nemotron 3 Ultra reaches 5.9x the throughput of GLM-5.1-754B-A40B, 4.8x the throughput of Kimi-K2.6-1T-A32B, and 1.6x the throughput of Qwen-3.5-397B-17B in an 8K-input, 64K-output benchmark. ArtificialAnlys' Terminal-Bench latency study also found it on the Pareto frontier for performance versus time per task under turn-budget limits.

NVIDIA's speed story is really an agent story. As baseten puts it, the claim is that step 300 can run as fast as step 3 because state stays fixed-size instead of dragging a quadratic attention bill behind every turn.

Post-training and openness

NVIDIA published more than weights. kimmonismus highlighted the unusual combination of open weights, open training data, and the full recipe, while yacineMTB zeroed in on the same thing with a simpler verdict: open weights and open data.

The training stack has a few unusually concrete details:

Pre-training ran in NVFP4 across 20T tokens, according to ctnzr's architecture note and the technical report.
Context was later extended to 1M tokens, per the technical report.
Post-training used SFT, RL, and Multi-teacher On-Policy Distillation, or MOPD, according to ctnzr's retweeted training-stack summary and the official blog.
MOPD used more than ten specialist teacher models, as cedric_chee noted from the report.
NVIDIA also disclosed training instability. nrehiew_ called out the report's note that the run was cut to 20T tokens after divergences, a detail ctnzr's architecture note had already flagged.

That level of recipe disclosure is half the launch. The interesting part is not just that the model is open, but that NVIDIA published enough of the stack for others to copy the training playbook rather than only consume the checkpoint.

Where it shows up

The day-one distribution list was huge:

Serving runtimes: lmsysorg for SGLang, vllm_project for vLLM, and Modal's example for low-latency SGLang serving on Modal.
Cloud and gateway surfaces: wandb for CoreWeave serverless inference, vercel_dev for AI Gateway, and SageMaker JumpStart for one-click AWS deployment.
Agent frameworks and products: LangChain for Deep Agents, OpenHandsDev for tier-1 OpenHands support, and ollama for direct launch commands into Claude Code, Hermes Agent, and OpenClaw.
End-user hosts: opencode, cline, AskVenice, and kilocode's free-in-Kilo post all offered same-day access.

That rollout breadth matters as much as the model card. Nemotron 3 Ultra did not arrive as an isolated checkpoint. It arrived pre-wired into the agent tooling stack people actually use.

Tool calling and agent shape

The Hugging Face card says reasoning is configurable through an enable_thinking chat-template switch, rather than a separate reasoning SKU, on the model card. That is a small implementation detail, but it changes how wrappers expose the model.

A second detail came from MaximeRivest's template inspection, which found an XML-style system prompt and tool format, with tool results fed back as user messages inside XML tags rather than through a distinct tool role. The vLLM launch post also emphasizes OpenAI-compatible serving, so adapters now have to bridge a very non-OpenAI native prompt format into OpenAI-shaped APIs.

Hardware floor and local workarounds

The official hardware floor is high. The Hugging Face model card lists minimum inference hardware as 8x GB200, B200, GB300, or B300, 16x H100, or 8x H200. ValsAI also notes the model is text-only, with a 1M context window and 128K output tokens.

That helps explain the split rollout. Self-hosting exists, but much of the first-day activity centered on hosted surfaces, free trials, and managed endpoints. NousResearch offered two free weeks in Nous Portal, the Hermes Agent setup guide link narrowed that free window to June 4 through June 18 for the :free tag, and danielhanchen said even a dynamic 1-bit Unsloth GGUF still lands around 190 GB on disk.

The result is a model that is open in the reproducibility sense before it is easy in the laptop sense.