Qwen releases Qwen3.5-Omni with 10-hour audio and 400s video support
Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.

TL;DR
- Alibaba launched Qwen3.5-Omni as a multimodal family spanning Lite, Flash, Plus, and Plus-Realtime, with native text, image, audio, and video understanding rather than a single demo-only model Launch thread Variant list.
- The headline capacity claims are unusually large for a live multimodal system: up to 10 hours of audio, 400 seconds of 720p video, training on more than 100 million hours of data, speech recognition across 113 languages, and speech generation in 36 Launch thread.
- Qwen is pushing “Audio-Visual Vibe Coding” as the signature trick, showing the model turn spoken instructions plus camera input into working websites or simple games in realtime demos Vibe coding demo Second vibe coding demo.
- The realtime stack is broader than voice chat. Qwen is also showing interruption handling, turn-taking, voice emotion and volume control, built-in web search, and complex function calling Launch thread Interruption demo Voice control demo.
- The caveat is in the name. Several community reactions noted that “omni” here is mostly about understanding and speech interaction, not open-weight media generation, and the family appears to be API and demo accessible rather than openly released Understanding caveat Not open weight note.
You can read the official launch post, jump straight into Qwen Chat, inspect the offline demo on Hugging Face, and compare that with the online demo. The developer angle is split across two Alibaba Cloud docs, one for the multimodal API and one for the realtime WebSocket API. The weirdest product framing is still the launch video genre Qwen chose to lead with: multimodal coding by talking at a camera.
Variants and envelope
Qwen positioned this as a family release, not a flagship-only drop. The variant list circulating on launch day names Qwen3.5-Omni-Lite, Flash, Plus, and Plus-Realtime, which lines up with the official launch messaging around separate offline and realtime use cases.
The raw envelope is the part engineers will remember. Qwen says Plus can natively handle up to 10 hours of audio or 400 seconds of 720p video, and that the speech stack recognizes 113 languages while speaking 36. Those are system-shape claims as much as benchmark claims, because they hint at where Qwen thinks long-session multimodal interaction is headed.
Audio-Visual Vibe Coding
Qwen made “Audio-Visual Vibe Coding” the hero feature, and the demos are specific enough to matter. The pitch is that a user describes a UI or game idea verbally while pointing the camera, and Qwen3.5-Omni-Plus generates a functional artifact instead of just returning a design description.
A second demo matters because it makes this look less like a one-off canned flow. Qwen shows another spoken, camera-guided build sequence that resolves into a working website, with code and rendered output appearing together.
That is Christmas-come-early marketing for agent builders, but it is also narrower than the name suggests. The launch materials consistently frame the model as understanding text, images, audio, and video natively, then using that understanding for coding, search, or function calls, not as a general media generator Understanding caveat.
Realtime conversation controls
The strongest realtime demos are not the coding ones. They are the interaction controls.
Qwen shows at least four distinct realtime behaviors:
- Multi-turn dialogue that preserves conversational state across several exchanges Interruption demo
- Intelligent interruption, where the model reacts to user interjections instead of waiting for a full turn boundary Interruption demo
- Fine-grained control over voice style, emotion, pace, and volume Voice control demo
- Built-in web search and complex function calling in the launch thread Launch thread
There is also a separate travel-planning demo, which suggests Qwen wants this read as a general voice agent platform, not just a flashy multimodal assistant Travel planning demo.
Benchmarks and caveats
The benchmark image in the launch thread is dense, but the pattern is clear. Qwen3.5-Omni-Plus posts stronger numbers than Gemini-3.1 Pro on several audio tasks, including Librispeech-other, MIR-1K, VoiceBench, RUL-MuchoMusic, and MMAU, while roughly trading blows on audio-visual understanding depending on the benchmark.
Three details stand out from the table and reposts:
- Qwen claims an edge over Gemini-3.1 Pro on DailyOmni, QualcommInteractive, and Omni-Cloze Launch benchmarks Benchmark repost
- Gemini still leads on some audio-visual and text rows, including WorldSense, IFEval, and MMLU-Redux Launch benchmarks
- The speech-generation rows compare Qwen against Gemini-2.5-Pro-TTS, which makes the launch feel like a stitched system comparison across multiple Gemini variants, not a clean one-model duel Launch benchmarks Community caveat
The other caveat is access. Public demos are live, but the launch reads like hosted availability through Qwen Chat, Hugging Face spaces, and Alibaba Cloud endpoints, not an open-weight release Not open weight note.
API surface
The developer docs add one concrete implementation detail the social posts barely mention. Qwen-Omni-Realtime uses WebSocket rather than ordinary request-response HTTP, accepts streaming audio plus continuous image frames, and returns text and audio in realtime through region-specific endpoints for Beijing and Singapore realtime API docs.
The same doc says a single WebSocket session can last up to 120 minutes before the server closes it automatically. The separate Qwen-Omni API doc frames the non-realtime path as multimodal input with streaming responses, which is a useful split: one surface for long-form multimodal understanding, another for live audio-video chat.
That last detail gives the release a cleaner shape than the launch thread alone. Qwen is not shipping one giant omni endpoint. It is shipping a small stack: hosted demos, an offline multimodal API, and a dedicated realtime transport for interactive voice and video sessions.