NVIDIA releases SANA-Streaming for real-time 1-minute video edits
NVIDIA released SANA-Streaming as a system for real-time editing of minute-long video, including scene swaps, style transfer, inpainting, and robotics overlays. The demo emphasized temporal consistency across longer clips rather than one-off image transformations.

TL;DR
- NVIDIA's SANA-Streaming project page says the system does causal video-to-video editing over one-minute streams, while minchoi's launch thread shows the headline claim in action.
- The official page claims 24 end-to-end FPS at 1280 x 704 on a single RTX 5090, and the accompanying arXiv paper says the DiT core reaches 58 FPS after system co-design.
- minchoi's autonomous-driving demo and minchoi's inpainting demo frame the release less as text-to-video and more as live footage rewrite: swap weather, remove a person, keep motion intact.
- NVIDIA attributes the temporal stability to a Hybrid Diffusion Transformer plus Cycle-Reverse Regularization on the project page, while minchoi's style-transfer example shows the kind of long-range consistency that claim is aiming at.
You can browse the official project page, read the paper, and jump to the broader NVlabs/Sana repo that minchoi's repo link pointed people toward. The demos are a fun spread: the driving clip turns rain into dawn snow, the edit demo removes a person across a full shot, and the robotics example swaps human limbs for robot hardware without changing the interaction.
Demo set
The strongest part of this release is how varied the edit prompts are. NVIDIA is not just showing filters. It is showing scene changes, object replacement, style transfer, inpainting, and body-part remapping on moving footage.
Across the tweet thread, the demo categories break down like this:
- Autonomous-driving weather and lighting shift: rain to light snowfall at dawn minchoi's autonomous-driving demo
- Person removal with temporally consistent background inpainting minchoi's inpainting demo
- Chinese ink wash style transfer over the full clip minchoi's style-transfer example
- Local object replacement, green car to metallic red minchoi's local-edit demo
- Full scene mood shift to a dawn aesthetic minchoi's dawn-scene demo
- Robotics overlay that preserves object interaction timing minchoi's robotics demo
Real-time numbers
NVIDIA's project page makes three concrete claims: one-minute editing, 1280 x 704 output, and 24 FPS end-to-end on a single RTX 5090. The paper adds that the diffusion core itself runs at 58 FPS after fused kernels and mixed-precision quantization tuned for Blackwell.
That is the interesting part for creators. The pitch is not offline cleanup after a render. The pitch is footage that keeps playing while the edit model keeps up.
Causal pipeline
According to the project page, SANA-Streaming uses a Hybrid Diffusion Transformer that mixes Gated DeltaNet blocks for compact global memory with softmax-attention blocks for local source alignment. The paper says Cycle-Reverse Regularization trains the model to reconstruct source frames from edited output, which is how NVIDIA tries to keep long clips from drifting.
The same page says a causal VAE decoder handles streaming generation and decoding. That matters because the release is explicitly about preserving source motion and non-edited content, not generating a brand-new clip from scratch.
Stacked edits
One of the more useful demos is minchoi's one-source-multiple-edits clip, which applies two different edits to the same source video: a background swap to a speakeasy lounge, then a wardrobe change to a burgundy velvet smoking jacket.
That same thread also ends in a robotics example where hands and forearms become articulated robot parts while the tool use stays intact minchoi's robotics demo. Between the stacked-edit clip and the robotics pass, NVIDIA is quietly showing that the system is built for selective rewrites inside an existing performance, not just whole-frame restyling.