Fun-CineForge opens multi-speaker dubbing with temporal modality and a dataset pipeline

Replying to @hasantoxr

Ready to try it? GitHub: github.com/FunAudioLLM/Fu… Fun-CineForge Homepage: funcineforge.github.io The future of cinematic AI dubbing is HERE. And it's OPEN SOURCE. #TongyiLab #AI #OpenSource #Dubbing #MachineLearning #Cinematic

12:28 PM · Mar 17, 2026

Read more on X

Fun-CineForge is more than a dubbing checkpoint. The project thread points to both a GitHub repo and a project page, and says the release includes an end-to-end production pipeline for building dubbing datasets from raw video. That matters because the same thread says existing dubbing data is often small, error-prone, expensive to label, and skewed toward monologues rather than conversations.

Replying to @hasantoxr

Here's the breakthrough: TEMPORAL MODALITY Fun-CineForge is the FIRST OPEN-SOURCE AI dubbing model to introduce this. Instead of just looking at lips, it understands: ✅ WHO is speaking ✅ WHEN they're speaking ✅ HOW the rhythm changes Even when faces are MISSING. That's the Show more

Watch on X

12:28 PM · Mar 17, 2026

Read 2 replies

On the modeling side, the release centers on “temporal modality.” In Tongyi Lab’s framing, the model is meant to infer speaker identity, timing, and rhythm even when faces are missing, which is the gap that usually breaks lip-driven dubbing systems temporal modality. The launch materials position it for narration, monologues, and multi-speaker dialogue rather than a single narrow use case.

Why this matters for dubbing workflows

"AI dubbing? That's easy, right? Just match the voice to the lips." That's what most people think. But here's the thing: multi-speaker dubbing is actually ONE OF THE HARDEST problems in AI video. And Tongyi Lab just solved it. World's FIRST OPEN-SOURCE AI dubbing model that Show more

Watch on X

12:27 PM · Mar 17, 2026

Read 18 replies

The creative use case is not basic voice swap. The launch demo argues that multi-speaker dubbing gets hard when a character is off-screen, the camera cuts fast, faces are blocked or blurred, or several people talk in the same scene launch thread. Those are common conditions in film scenes, trailers, interviews, and localized social video, where editors cannot depend on clean frontal lip cues.