releaseJune 25, 2026

Wan-Streamer v0.1 opens real-time video agents with live voice demos

Wan-Streamer v0.1 surfaced with paper links and demos showing real-time video conversations, live recording, and spoken avatar responses. That matters for interactive characters and live creator tools because multimodal generation moves from rendered clips to low-latency back-and-forth.

3 min read

Wan-Streamer v0.1 opens real-time video agents with live voice demos

TL;DR

_akhaliq's paper post surfaced Wan-Streamer v0.1 as an "end-to-end real-time interactive foundation model," and minchoi's link post tied that to a project page and Hugging Face paper.
The clearest product reveal in minchoi's real-time recording demo is a live loop with video, voice, and text running at once, which pushes the format past one-shot avatar clips.
Two conversation demos, minchoi's first agent demo and minchoi's second agent demo, show different voices, facial behavior, and call-style framing instead of a fixed talking head.
Reaction posts split between hype and pragmatism: minchoi's thread opener framed it as a jump from voice mode to video agents, while LLMJunky's response called it a fresh real-time voice-agent format with a strong "Jarvis" feel.

You can jump straight to the project page, open the Hugging Face paper, watch a live recording demo, and compare that with one avatar call plus another call that swaps both scene tone and voice style.

Real-time recording

The most concrete capability here is simultaneous live video, live speech, and live text. minchoi's real-time recording post says that plainly, and the attached demo shows the interface behaving like an interactive session instead of a rendered output queue.

Wan-Streamer live recording demo

That matters because low-latency multimodal tools usually surface as separate features, speech in one tab, avatar animation in another, video generation somewhere else. Wan-Streamer is pitching the whole loop as one system.

Avatar calls

The two agent demos show the format better than any summary could. minchoi's first agent demo describes a cheerful female voice in a bright room, while minchoi's second agent demo switches to a clear male voice in a warmer call setup.

Across the clips, the notable pattern is variation:

different voice presets
different scene framing
conversational back-and-forth instead of monologue
expression and timing tuned to spoken replies

That puts the demos closer to live character interaction than to the usual AI presenter video.

Early read from creators

The immediate reaction layer is small, but it is pointed. minchoi's thread opener sold the demos as "not voice mode anymore," and LLMJunky's reaction said the format "really nails the Jarvis vibe."

For creative readers, the interesting part is the interface grammar already visible in the evidence:

a face that listens while the user speaks
spoken replies with synced expression changes
call-style presentation instead of a chat box alone
enough responsiveness to make the demo feel performative, not batch-generated

Project page and paper

The source trail is unusually direct for an early demo cycle. minchoi's link post points to both the project page and the Hugging Face paper, while _akhaliq's earlier post had already surfaced the paper itself.

That gives this one a better paper-to-demo chain than most viral avatar clips. The public materials were part of the reveal from the start, not something community sleuths had to reconstruct later.