Skip to content
AI Primer
release

Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.

4 min read
Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context
Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

TL;DR

You can jump from the NVIDIA tech report to the SGLang cookbook, check the vLLM serving example, try the hosted model on OpenRouter or Together AI, and pull it locally from Ollama or as GGUF weights via Unsloth.

Unified multimodal loop

The launch is aimed at teams that do not want separate speech, vision, and language models glued together with routing logic. In baseten's description, Nemotron 3 Nano Omni uses one unified context window across audio, images, text, and video, and the same post says NVIDIA is positioning it for subagents handling computer use, document intelligence, and video or audio reasoning.

That architectural pitch shows up almost verbatim across the rollout. ctnzr's launch post called out the Nemotron Hybrid SSM MoE architecture, while fal's announcement summarized the product as a single model for multimodal agents with text, image, video, and audio in one loop.

30B total, 3B active, 256K context

Across the launch posts, the stable spec sheet looked like this:

The benchmark framing was also consistent, though still vendor-measured in the public launch graphics. OpenRouter's launch slide compared Nemotron 3 Nano Omni against Qwen3-Omni across MMlongBench-Doc, DailyOmni, VoiceBench, OCRBenchV2, MediaPerf, and WorldSense, and lmsysorg's post added the headline numbers: up to 7.4x throughput on multi-doc workloads, 9.2x on video, and about 20% higher multimodal intelligence than the leading open alternative.

Serving stack showed up on day zero

The quickest signal in this launch was how many inference surfaces were ready immediately.

That combination makes the launch look less like an isolated model drop and more like a pre-wired ecosystem release. The interesting detail is that the integration posts did not just say "supported," they exposed parser choices, quantization hooks, and media sampling knobs that tell you where the real deployment complexity still lives.

Local paths got attention fast

The most immediately practical follow-on was local packaging. UnslothAI's post said the model can run on roughly 25 GB of RAM, with 8-bit needing 36 GB, and linked both a GGUF release and an Unsloth guide.

Ollama also moved on launch day. According to ollama's local release note, Nemotron 3 Nano Omni is available locally through Ollama, but it requires the newer 0.22 release, which is the clearest surfaced compatibility caveat in the evidence set.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR3 posts
Unified multimodal loop1 post
30B total, 3B active, 256K context2 posts
Serving stack showed up on day zero3 posts
Share on X