releaseMarch 31, 2026

Qwen3.5-Omni claims audio-visual coding, timestamped captions, and one-sample voice cloning

A launch thread says Qwen3.5-Omni can turn whiteboards and gameplay videos into code, generate timestamped audiovisual logs, and run realtime voice features with interruption handling. The linked materials also cite 256K context, long audio and video windows, and API access for offline and realtime modes.

Qwen Voice Cloning Vibe Coding

3 min read

Qwen3.5-Omni claims audio-visual coding, timestamped captions, and one-sample voice cloning

TL;DR

A launch thread claims Qwen3.5-Omni can do “audio-visual vibe coding,” turning whiteboard brainstorm videos into websites and gameplay recordings into playable code launch demo.
The same thread says the model also produces script-like video logs with frame-accurate timestamps, scene slicing, and character-plus-audio mapping instead of generic summaries captioning demo.
In realtime use, the launch materials claim one-sample voice cloning, controls for speed, volume, and emotion, plus interruption handling that distinguishes pauses from background noise realtime demo.
Linked materials around the release say Qwen3.5-Omni has a 256K context window, can handle 10 hours of audio and 1 hour of video, and is available in chat plus offline and realtime APIs specs post access links.

What shipped for creators

Hasan Toor

@hasantoxr

·Follow

🚨 BREAKING: Qwen3.5-Omni just dropped a mind-blowing emergent ability: Audio-Visual Vibe Coding No specific training. Just raw power. → Turn whiteboard brainstorming videos into fully functional websites → Turn gameplay screen recordings into playable code Vibe Coding just Show more

Watch on X

12:37 PM · Mar 31, 2026

134

Read 9 replies

Qwen3.5-Omni is being pitched as a multimodal production model, not just a chatbot. In the launch thread, Hasan Toor says it can infer code directly from visual and audiovisual input, with demos framed around whiteboard-to-website and gameplay-video-to-code workflows launch demo. The linked release materials also point readers to a browser version on Qwen Chat, a report page, and separate offline and realtime API access points via Qwen Chat, offline API, and realtime API.

The technical claims are broad. According to the thread's specs post, Qwen3.5-Omni has a 256K context window, supports up to 10 hours of audio and 1 hour of video in one pass, recognizes 74 languages, and generates 29. Those numbers matter most for long-form creative review: recorded workshops, edit sessions, reference reels, and extended voice interactions fit the way creators actually work.

Where the workflow looks different

Hasan Toor

@hasantoxr

·Follow

Replying to @hasantoxr

Generic video summaries are dead. Qwen3.5-Omni delivers script-level audio-visual captioning fully customizable: Frame-accurate timestamps Automatic scene slicing Character + audio mapping Follows your exact instructions Raw footage → production-ready logs in seconds. Show more

Watch on X

12:37 PM · Mar 31, 2026

Read 2 replies

The clearest creator-facing workflow change is logging and breakdown. The captioning demo claims Qwen3.5-Omni can output frame-accurate timestamps, slice scenes automatically, map characters to audio, and follow custom instructions, turning raw footage into something closer to an edit log or script draft than a plain summary captioning demo. For filmmakers, documentarians, and social teams, that suggests less manual note-taking between ingest and first cut.

Hasan Toor

@hasantoxr

·Follow

Replying to @hasantoxr

Real-time Mode feels scary human: Instantly control volume, speed, emotion Voice cloning from just one sample Smart semantic interruption knows when you’re thinking vs. background noise Built-in web search + complex tool calling Natural turn-taking. Real conversation rhythm. Show more

Watch on X

12:38 PM · Mar 31, 2026

Read 1 reply

The realtime features push in a different direction: live performance and voice interfaces. The thread claims adjustable emotion, speed, and volume, one-sample voice cloning, web search, tool calling, and “semantic interruption” for more natural turn-taking realtime demo. If those demos hold up outside the launch thread, Qwen3.5-Omni is less interesting as a general assistant than as a multimodal layer for prototyping narrated apps, voiced characters, and hands-free creative tools.

🧾 More sources

TL;DR2 tweets

Top-line claims from the launch thread and linked release materials that summarize what matters for creative users.

What shipped for creators2 tweets

Core launch claims and access details covering coding-from-video/sketches, model scale, and where creators can try or build with it.