releaseApril 28, 2026

Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.

4 min read

Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

TL;DR

NVIDIA shipped Nemotron 3 Nano Omni as an open 30B multimodal model with roughly 3B active parameters, and ctnzr's launch post plus lmsysorg's deployment thread both describe a single model that takes audio, video, image, and text and returns text.
The core pitch is stack simplification: according to baseten's launch post, the model folds audio and vision encoders into one architecture, while fal's launch note frames it as one reasoning loop for multimodal agents.
Day-one support landed fast: vllm_project's support post, lmsysorg's SGLang post, OpenRouter's availability post, togethercompute's launch post, and ollama's local release note all announced serving or access on launch day.
NVIDIA and partners centered efficiency as much as accuracy, with lmsysorg's benchmark summary claiming up to 7.4x throughput on multi-document workloads and 9.2x on video, while UnslothAI's local-run post said the model can run on about 25 GB of RAM.

You can jump from the NVIDIA tech report to the SGLang cookbook, check the vLLM serving example, try the hosted model on OpenRouter or Together AI, and pull it locally from Ollama or as GGUF weights via Unsloth.

Unified multimodal loop

The launch is aimed at teams that do not want separate speech, vision, and language models glued together with routing logic. In baseten's description, Nemotron 3 Nano Omni uses one unified context window across audio, images, text, and video, and the same post says NVIDIA is positioning it for subagents handling computer use, document intelligence, and video or audio reasoning.

That architectural pitch shows up almost verbatim across the rollout. ctnzr's launch post called out the Nemotron Hybrid SSM MoE architecture, while fal's announcement summarized the product as a single model for multimodal agents with text, image, video, and audio in one loop.

30B total, 3B active, 256K context

Across the launch posts, the stable spec sheet looked like this:

30B total parameters, per lmsysorg's deployment thread and vllm_project's support post
About 3B active parameters per forward pass, per lmsysorg's deployment thread
256K context window, per lmsysorg's deployment thread, vllm_project's support post, and OpenRouter's availability post
Hybrid Transformer-Mamba MoE or Hybrid SSM MoE architecture, per vllm_project's wording and ctnzr's launch post
FP8 and NVFP4 support, per lmsysorg's deployment thread and vllm_project's support post
Open weights, per vllm_project's support post

The benchmark framing was also consistent, though still vendor-measured in the public launch graphics. OpenRouter's launch slide compared Nemotron 3 Nano Omni against Qwen3-Omni across MMlongBench-Doc, DailyOmni, VoiceBench, OCRBenchV2, MediaPerf, and WorldSense, and lmsysorg's post added the headline numbers: up to 7.4x throughput on multi-doc workloads, 9.2x on video, and about 20% higher multimodal intelligence than the leading open alternative.

Serving stack showed up on day zero

The quickest signal in this launch was how many inference surfaces were ready immediately.

vllm_project's support post said vLLM had day-zero support with tool calling, reasoning support, and efficient video sampling for long-video workloads.
The same vllm_project screenshot showed a concrete vllm serve example with --enable-auto-tool-choice, --tool-call-parser qwen3_coder, --reasoning-parser nemotron_v3, and media IO settings for 512 video frames at 1 fps.
lmsysorg's deployment thread announced SGLang support on launch day, and the cookbook is linked directly from lmsysorg's cookbook post.
OpenRouter's availability post put up a hosted route immediately, including a free model page.
togethercompute's launch post and fal's launch post both pushed managed inference availability the same day.

That combination makes the launch look less like an isolated model drop and more like a pre-wired ecosystem release. The interesting detail is that the integration posts did not just say "supported," they exposed parser choices, quantization hooks, and media sampling knobs that tell you where the real deployment complexity still lives.

Local paths got attention fast

The most immediately practical follow-on was local packaging. UnslothAI's post said the model can run on roughly 25 GB of RAM, with 8-bit needing 36 GB, and linked both a GGUF release and an Unsloth guide.

Ollama also moved on launch day. According to ollama's local release note, Nemotron 3 Nano Omni is available locally through Ollama, but it requires the newer 0.22 release, which is the clearest surfaced compatibility caveat in the evidence set.

TL;DR

Unified multimodal loop

30B total, 3B active, 256K context

Serving stack showed up on day zero

Local paths got attention fast

Discussion across the web