Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.

You can read the official launch post, jump straight into Qwen Chat, inspect the offline demo on Hugging Face, and compare that with the online demo. The developer angle is split across two Alibaba Cloud docs, one for the multimodal API and one for the realtime WebSocket API. The weirdest product framing is still the launch video genre Qwen chose to lead with: multimodal coding by talking at a camera.
Qwen positioned this as a family release, not a flagship-only drop. The variant list circulating on launch day names Qwen3.5-Omni-Lite, Flash, Plus, and Plus-Realtime, which lines up with the official launch messaging around separate offline and realtime use cases.
The raw envelope is the part engineers will remember. Qwen says Plus can natively handle up to 10 hours of audio or 400 seconds of 720p video, and that the speech stack recognizes 113 languages while speaking 36. Those are system-shape claims as much as benchmark claims, because they hint at where Qwen thinks long-session multimodal interaction is headed.
Qwen made โAudio-Visual Vibe Codingโ the hero feature, and the demos are specific enough to matter. The pitch is that a user describes a UI or game idea verbally while pointing the camera, and Qwen3.5-Omni-Plus generates a functional artifact instead of just returning a design description.
A second demo matters because it makes this look less like a one-off canned flow. Qwen shows another spoken, camera-guided build sequence that resolves into a working website, with code and rendered output appearing together.
That is Christmas-come-early marketing for agent builders, but it is also narrower than the name suggests. The launch materials consistently frame the model as understanding text, images, audio, and video natively, then using that understanding for coding, search, or function calls, not as a general media generator Understanding caveat.
The strongest realtime demos are not the coding ones. They are the interaction controls.
Qwen shows at least four distinct realtime behaviors:
There is also a separate travel-planning demo, which suggests Qwen wants this read as a general voice agent platform, not just a flashy multimodal assistant Travel planning demo.
The benchmark image in the launch thread is dense, but the pattern is clear. Qwen3.5-Omni-Plus posts stronger numbers than Gemini-3.1 Pro on several audio tasks, including Librispeech-other, MIR-1K, VoiceBench, RUL-MuchoMusic, and MMAU, while roughly trading blows on audio-visual understanding depending on the benchmark.
Three details stand out from the table and reposts:
The other caveat is access. Public demos are live, but the launch reads like hosted availability through Qwen Chat, Hugging Face spaces, and Alibaba Cloud endpoints, not an open-weight release Not open weight note.
The developer docs add one concrete implementation detail the social posts barely mention. Qwen-Omni-Realtime uses WebSocket rather than ordinary request-response HTTP, accepts streaming audio plus continuous image frames, and returns text and audio in realtime through region-specific endpoints for Beijing and Singapore realtime API docs.
The same doc says a single WebSocket session can last up to 120 minutes before the server closes it automatically. The separate Qwen-Omni API doc frames the non-realtime path as multimodal input with streaming responses, which is a useful split: one surface for long-form multimodal understanding, another for live audio-video chat.
That last detail gives the release a cleaner shape than the launch thread alone. Qwen is not shipping one giant omni endpoint. It is shipping a small stack: hosted demos, an offline multimodal API, and a dedicated realtime transport for interactive voice and video sessions.
Qwen released the Qwen3.5-Omni model family, designed for native understanding of text, images, audio, and video. - Qwen3.5-Omni-Lite - Qwen3.5-Omni-Flash - Qwen3.5-Omni-Plus - Qwen3.5-Omni-Plus-Realtime
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. ย Show more
Demo2๏ผAudio-Visual Vibe Coding
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
Here's another demo of Audio-Visual Vibe Coding~
Demo2๏ผAudio-Visual Vibe Coding
Demo4 ๏ผMulti-Turn Dialogue and Intelligent Interruption
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
Demo5๏ผ Voice Style, Emotion and Volume Control
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. ย Show more
Qwen released the Qwen3.5-Omni model family, designed for native understanding of text, images, audio, and video. - Qwen3.5-Omni-Lite - Qwen3.5-Omni-Flash - Qwen3.5-Omni-Plus - Qwen3.5-Omni-Plus-Realtime
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
๐ Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. ย Show more