Skip to content
AI Primer
release

NVIDIA releases SANA-Streaming for real-time 1-minute video edits

NVIDIA released SANA-Streaming as a system for real-time editing of minute-long video, including scene swaps, style transfer, inpainting, and robotics overlays. The demo emphasized temporal consistency across longer clips rather than one-off image transformations.

3 min read
NVIDIA releases SANA-Streaming for real-time 1-minute video edits
NVIDIA releases SANA-Streaming for real-time 1-minute video edits

TL;DR

You can browse the official project page, read the paper, and jump to the broader NVlabs/Sana repo that minchoi's repo link pointed people toward. The demos are a fun spread: the driving clip turns rain into dawn snow, the edit demo removes a person across a full shot, and the robotics example swaps human limbs for robot hardware without changing the interaction.

Demo set

The strongest part of this release is how varied the edit prompts are. NVIDIA is not just showing filters. It is showing scene changes, object replacement, style transfer, inpainting, and body-part remapping on moving footage.

Across the tweet thread, the demo categories break down like this:

Real-time numbers

NVIDIA's project page makes three concrete claims: one-minute editing, 1280 x 704 output, and 24 FPS end-to-end on a single RTX 5090. The paper adds that the diffusion core itself runs at 58 FPS after fused kernels and mixed-precision quantization tuned for Blackwell.

That is the interesting part for creators. The pitch is not offline cleanup after a render. The pitch is footage that keeps playing while the edit model keeps up.

Causal pipeline

According to the project page, SANA-Streaming uses a Hybrid Diffusion Transformer that mixes Gated DeltaNet blocks for compact global memory with softmax-attention blocks for local source alignment. The paper says Cycle-Reverse Regularization trains the model to reconstruct source frames from edited output, which is how NVIDIA tries to keep long clips from drifting.

The same page says a causal VAE decoder handles streaming generation and decoding. That matters because the release is explicitly about preserving source motion and non-edited content, not generating a brand-new clip from scratch.

Stacked edits

One of the more useful demos is minchoi's one-source-multiple-edits clip, which applies two different edits to the same source video: a background swap to a speakeasy lounge, then a wardrobe change to a burgundy velvet smoking jacket.

That same thread also ends in a robotics example where hands and forearms become articulated robot parts while the tool use stays intact minchoi's robotics demo. Between the stacked-edit clip and the robotics pass, NVIDIA is quietly showing that the system is built for selective rewrites inside an existing performance, not just whole-frame restyling.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR2 posts
Demo set3 posts
Share on X