workflowMay 5, 2026

Seedance 2.0 supports multi-speaker lip sync in live-action and animation

Curious Refuge posted tests showing Seedance 2.0 syncing multiple speakers from a reference image plus blacked-out video or audio, using shot-by-shot dialogue prompts. The workflow moves Seedance closer to directed dialogue scenes, but prompt wording and voice guidance still affect stability.

5 min read

Seedance 2.0 supports multi-speaker lip sync in live-action and animation

TL;DR

Curious Refuge says Seedance 2.0 can lip-sync more than one speaker from a small input stack: a reference image, a blacked-out video with audio or an audio file, plus a prompt that spells out camera moves and dialogue, according to CuriousRefuge's workflow post and CuriousRefuge's multi-speaker demo.
The strongest reveal in CuriousRefuge's car-scene test, the couch-scene follow-up, and the coffee-date example is not just mouth movement, but shot-by-shot dialogue staging across cuts.
Prompted voice-over appears to be a control lever, because CuriousRefuge's original note says the prompted VO helps keep the model from hallucinating or riffing away from the original speaker.
ByteDance's official launch post says Seedance 2.0 is built for mixed text, image, audio, and video input, while Dreamina's official tool page explicitly says the product supports voice-style guidance and lip-sync.

You can read ByteDance's launch post, check Dreamina's lip-sync and voice-guidance FAQ, and even see that the same model is already exposed through a Replicate readme with mixed-input limits. On the evidence side, the noir car scene, the couch argument, and the awkward coffee date all use the same basic recipe but push it into multi-shot dialogue instead of the usual single talking head.

The input recipe

Curious Refuge reduced the workflow to three inputs in its original post: a reference image, a blacked-out video plus audio or just an audio file, and a prompt that describes camera motion and voice-over.

That lines up with ByteDance's official launch post, which says Seedance 2.0 accepts text, image, audio, and video in the same generation pipeline, and with the Replicate readme, which says a single run can combine up to 9 images, 3 video clips, and 3 audio files.

The interesting part is how little structure the creator-side workflow needs. The tweet evidence does not show rigging, phoneme controls, or a separate avatar editor. It shows a directed prompt sitting on top of a multimodal video model.

Shot lists as direction

The prompts in the first multi-speaker example, the second, and the third read more like a shot list than a speech transcript.

Each prompt specifies:

starting composition, such as a wide two-shot or medium two-shot
camera changes, such as cut to medium, close-up, or slight push-in
line-by-line dialogue
reaction beats, pauses, and awkward silences
an ending frame for the scene

That matches Seedance 2.0's official framing as a model for controllable multi-shot video. ByteDance's paper says the system targets multi-shot narrative capability, facial-expression detail, and cross-frame consistency, which is exactly the part these tests are poking.

Multi-speaker scenes, not just avatars

Most lip-sync demos stop at one face speaking to camera. Curious Refuge pushed the workflow into dialogue scenes with multiple speakers and moving coverage.

The three examples shown in the thread are notably different:

The car scene uses noir-style coverage through multiple interior and exterior angles.
The couch scene tests comedic timing, eye contact, and deadpan reaction shots.
The coffee-date scene leans on micro-expression shifts, where the line lands and the face has to change with it.

That is a more useful benchmark for filmmakers and brand teams than a single front-facing avatar clip, because scene grammar breaks these systems faster than a monologue does.

Prompted VO is doing real control work

Curious Refuge's most concrete workflow note lives in the original tweet, where it says the prompted VO is important if you want to stop the model from hallucinating or drifting from the source speaker.

That detail fits Dreamina's official Seedance page, which says users can provide a voice sample for style guidance and lip-sync. The model is not just matching mouth shapes to sound. It is being steered by explicit audio reference plus text instructions about what the scene should sound and look like.

The same reply thread adds one more practical detail: when another user asked which platform was being used, CuriousRefuge answered "Freepik". The tweets do not explain how Freepik exposes Seedance 2.0 under the hood, but they do show the workflow is already escaping the official demo surface.

Access surfaces

ByteDance's product page positions Seedance 2.0 as a unified audio-video model for text, image, audio, and video input, and Dreamina's tool page says the model is in early access with VIP users able to try it for free.

That matters because the thread sits in a broader creator-tool moment. On the same day Curious Refuge was posting Seedance tests, it was also promoting a Luma Agents workshop centered on planning, generating, iterating, and refining across image, video, audio, and text. The overlap is the real tell: multi-speaker lip-sync is getting folded into larger AI production workflows, not treated as a novelty feature bolted onto avatar apps.