Qwen3.5-Omni launches 4 variants with 10h audio and 400s video input

Demo2：Audio-Visual Vibe Coding

Qwen

@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.

3:19 PM · Mar 30, 2026

Here's another demo of Audio-Visual Vibe Coding~

Qwen

@Alibaba_Qwen

Demo2：Audio-Visual Vibe Coding

3:36 PM · Mar 30, 2026

280

Read 17 replies

This is the launch's headline feature for a reason. Qwen is showing a user speaking to a camera, mixing visual context with natural language, and getting a functional site or game back. That is a more concrete direction for multimodal coding than the usual screenshot-to-HTML demos, because the input channel is live audio plus video, not a static artifact.

The Hugging Face online demo suggests Alibaba wants people to treat this as an interactive interface, not just a research preview. If the model holds up outside canned prompts, the real story here is that multimodal coding UX is shifting from "upload and ask" to "show and narrate."

10 Hours of Audio, 400 Seconds of Video

3:15 PM · Mar 30, 2026

4.6K

Read 171 replies

The context claim is huge: up to 10 hours of audio or 400 seconds of 720p video, trained on more than 100 million hours of data, plus speech recognition across 113 languages and speech generation in 36 launch thread. Even if most teams never hit those limits, long native audio and video context changes what you can build around call review, meeting analysis, tutoring, surveillance review, and media ops.

The benchmark table in the launch materials matters more for its shape than for any single score. Qwen3.5-Omni-Plus posts strong numbers on VoiceBench, Fleurs S2TT, and several audio-visual tasks, while the slide also shows where Gemini still leads, such as WorldSense and some text-heavy evals benchmark recap. That makes this feel less like a clean knockout and more like a serious catch-up release in the multimodal stack's hardest modalities.

Realtime API via WebSocket

Demo3：Travel Planning

Qwen

@Alibaba_Qwen

3:20 PM · Mar 30, 2026

156

Read 3 replies

The polished demos point at a real product surface, and the docs fill in the engineering details. According to Alibaba's Qwen-Omni-Realtime documentation, the realtime model accepts streaming audio and continuous image frames, returns text and audio in real time, uses WebSocket, and keeps a single session alive for up to 120 minutes.

That setup is a meaningful design choice. WebSocket sessions with long lifetimes make Qwen3.5-Omni-Plus-Realtime look suited for persistent assistants, live copilots, and kiosk-style agents, not just one-shot voice chat. The standard Qwen-Omni API docs are narrower, describing OpenAI-compatible invocation where requests combine text with one additional modality and return streaming text or speech. Engineers evaluating the release should read those two docs side by side before assuming feature parity across the family.

Voice Control and Turn-Taking

Demo4 ：Multi-Turn Dialogue and Intelligent Interruption

Qwen

@Alibaba_Qwen

3:22 PM · Mar 30, 2026

274

Read 7 replies

Demo5： Voice Style, Emotion and Volume Control

Qwen

@Alibaba_Qwen

3:23 PM · Mar 30, 2026

321

Read 12 replies

Qwen's realtime pitch is not just lower latency. The demos show interruption handling, multi-turn dialogue, and direct control over style, emotion, pace, and volume. Those are product features, but they also imply a model and inference stack tuned for live correction, partial intent, and noisy environments interruption demo voice control demo.

This matters because most voice agents still feel brittle at the exact moment a user stops speaking clearly or changes their mind mid-sentence. Qwen is betting that better turn-taking and controllable speech are as important as raw reasoning quality for realtime assistants. The realtime docs also list built-in support for web search and complex function calling, which is the combination you need for voice agents that act instead of just narrating.

Closed Weights, Cloud-First Release

TestingCatalog News 🗞

@testingcatalog

Qwen released the Qwen3.5-Omni model family, designed for native understanding of text, images, audio, and video. - Qwen3.5-Omni-Lite - Qwen3.5-Omni-Flash - Qwen3.5-Omni-Plus - Qwen3.5-Omni-Plus-Realtime

Qwen

@Alibaba_Qwen

3:25 PM · Mar 30, 2026

415

Read 11 replies

cedric

@cedric_chee