Alibaba introduced the Qwen3.5-Omni family for native text, image, audio, and video understanding, and demoed audio-visual vibe coding plus realtime voice controls. The model extends multimodal workflows from short clips to long-form inputs and live interaction.

Qwen3.5-Omni is a new family rather than a single endpoint. Alibaba's launch post names Plus, Flash, and Light tiers, while the offline API docs and realtime docs split the product into batch-style multimodal processing and a WebSocket-based live interaction stack. The realtime docs say developers can integrate either directly over WebSocket or through the DashScope SDK in Python or Java, with sessions running up to 120 minutes and region-specific endpoints for Beijing and Singapore.
On published evals, Alibaba claims Qwen3.5-Omni-Plus “outperform[s] Gemini-3.1 Pro in audio” and matches it in broader audio-visual understanding launch post. The benchmark table in
backs up a mixed but competitive picture: Qwen leads Gemini 3.1 Pro on DailyOmni, QualcommInteractive, Omni-Cloze, VoiceBench, RUL-MuchoMusic, MMAU, and several vision tasks, while trailing on WorldSense and some text benchmarks.
Alibaba's headline workflow is “Audio-Visual Vibe Coding”: in the demo post, a spoken prompt and camera input produce working code for a site or game, and a second example in another demo shows the same pattern landing a complete website from live multimodal input. That matters less as a flashy demo than as a signal that Qwen is treating camera frames, speech, and code generation as one loop rather than separate model calls.
The other concrete offline workflow is long-form media parsing. In the captioning demo, Qwen generates dense audio-visual captions, and Alibaba says those outputs can be elevated to “script-level captioning” with timestamps, scene boundaries, and speaker mapping launch post. A separate travel-planning demo also shows the model grounding on UI, dates, and itinerary details from mixed visual and spoken input instead of a single uploaded image or clip.
The realtime demos are more specific than the announcement language. In the voice-control demo, the model's speech is adjusted for style, emotion, and volume mid-conversation; in the interruption demo, Alibaba shows “Multi-Turn Dialogue” and “Intelligent Interruption,” matching the launch claim that Qwen can do “smart turn-taking” and ignore noise launch post.
For engineers, the key point is that Alibaba has published both realtime docs and offline API docs at launch. The offline docs describe Python and Node.js quickstarts and note that responses are streaming-only, while the realtime docs add live audio/video ingestion, function calling, and web search. The feature boundary is still worth stating clearly: as one early reaction put it, “omni” does not mean a general image generator, so the release is best understood as a multimodal understanding-and-interaction stack with speech output, not an everything model.
🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. Show more
Demo2:Audio-Visual Vibe Coding
🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
Demo1:Audio-Visual Captioning
🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.
Demo5: Voice Style, Emotion and Volume Control
🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.