Skip to content

explore all stories

Explore›Alibaba releases Qwen3.5-Omni with 10h audio input and video support

releaseMarch 30, 2026

Alibaba releases Qwen3.5-Omni with 10h audio input and video support

Alibaba introduced the Qwen3.5-Omni family for native text, image, audio, and video understanding, and demoed audio-visual vibe coding plus realtime voice controls. The model extends multimodal workflows from short clips to long-form inputs and live interaction.

Multimodal Voice Agents Realtime AI

4 min read

Alibaba releases Qwen3.5-Omni with 10h audio input and video support

TL;DR

Alibaba's launch post introduced Qwen3.5-Omni as a multimodal family for native text, image, audio, and video understanding, with Plus, Flash, and Light variants and a separate realtime SKU also reflected in TestingCatalog's summary.
The main implementation change is input scale: according to the launch post, Qwen3.5-Omni can handle up to 10 hours of audio or 400 seconds of 720p video, while Alibaba says it also adds script-level captioning with timestamps, scene cuts, and speaker mapping.
Qwen is positioning the model as both offline and live infrastructure: Alibaba's announcement highlights built-in web search, complex function calling, turn-taking, and voice controls, and the demos in the vibe-coding post and the interruption demo show those features in active use.
One caveat from early readers is scope: as kimmonismus notes, “omni” here is mostly about interpreting multimodal input rather than full image generation, even though the system does synthesize speech and Alibaba says voice cloning is coming.

What shipped, and where does it move the benchmark?

Qwen3.5-Omni is a new family rather than a single endpoint. Alibaba's launch post names Plus, Flash, and Light tiers, while the offline API docs and realtime docs split the product into batch-style multimodal processing and a WebSocket-based live interaction stack. The realtime docs say developers can integrate either directly over WebSocket or through the DashScope SDK in Python or Java, with sessions running up to 120 minutes and region-specific endpoints for Beijing and Singapore.

On published evals, Alibaba claims Qwen3.5-Omni-Plus “outperform[s] Gemini-3.1 Pro in audio” and matches it in broader audio-visual understanding launch post. The benchmark table in

backs up a mixed but competitive picture: Qwen leads Gemini 3.1 Pro on DailyOmni, QualcommInteractive, Omni-Cloze, VoiceBench, RUL-MuchoMusic, MMAU, and several vision tasks, while trailing on WorldSense and some text benchmarks.

What do the new workflows actually look like?

Alibaba's headline workflow is “Audio-Visual Vibe Coding”: in the demo post, a spoken prompt and camera input produce working code for a site or game, and a second example in another demo shows the same pattern landing a complete website from live multimodal input. That matters less as a flashy demo than as a signal that Qwen is treating camera frames, speech, and code generation as one loop rather than separate model calls.

The other concrete offline workflow is long-form media parsing. In the captioning demo, Qwen generates dense audio-visual captions, and Alibaba says those outputs can be elevated to “script-level captioning” with timestamps, scene boundaries, and speaker mapping launch post. A separate travel-planning demo also shows the model grounding on UI, dates, and itinerary details from mixed visual and spoken input instead of a single uploaded image or clip.

How much of this is ready for real-time apps?

The realtime demos are more specific than the announcement language. In the voice-control demo, the model's speech is adjusted for style, emotion, and volume mid-conversation; in the interruption demo, Alibaba shows “Multi-Turn Dialogue” and “Intelligent Interruption,” matching the launch claim that Qwen can do “smart turn-taking” and ignore noise launch post.

For engineers, the key point is that Alibaba has published both realtime docs and offline API docs at launch. The offline docs describe Python and Node.js quickstarts and note that responses are streaming-only, while the realtime docs add live audio/video ingestion, function calling, and web search. The feature boundary is still worth stating clearly: as one early reaction put it, “omni” does not mean a general image generator, so the release is best understood as a multimodal understanding-and-interaction stack with speech output, not an everything model.

🧾 More sources

TL;DR2 tweets

Top-line facts on the launch, model family, context-window scale, and the main caveat about what “omni” covers.

What shipped, and where does it move the benchmark?1 tweets

Covers the family structure, API surfaces, and the benchmark claims that define the release for implementation teams.

What do the new workflows actually look like?2 tweets

Groups the concrete demos showing how Qwen combines audio, video, and generation in practice.

How much of this is ready for real-time apps?2 tweets

Focuses on live interaction features, voice control, interruption handling, and the operational docs behind them.

Daily AI Digest

Get stories like this delivered to your inbox every morning

Related

Gemini 3.1 Flash Live launches with 90.8% audio tool-use score and 128K context

Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.

Mistral releases Voxtral TTS with 3-second cloning and 68.4% win rate

Mistral's Voxtral TTS splits speech into semantic and acoustic tokens, uses a low-bitrate codec, and claims a 68.4% win rate over ElevenLabs Flash v2.5 on voice cloning with about 3-25 seconds of reference audio. The architecture targets multilingual cloning and higher-quality speech without a fully autoregressive audio stack, so voice teams should compare it against current TTS pipelines.

Mistral launches Voxtral TTS with 9 languages and 90 ms first audio

Mistral released open-weight Voxtral TTS with low-latency streaming, voice cloning, and cross-lingual adaptation, and vLLM Omni shipped day-0 support. Voice-agent teams should compare quality, latency, and serving cost against closed APIs.

Cohere releases Transcribe 2B with 4.7% AA-WER and 60x realtime speed

Cohere Transcribe arrived as an open-weights 2B speech model trained on 14 languages and scored 4.7% on Artificial Analysis AA-WER. It pairs near-frontier accuracy with about one second of compute per minute of audio.

OpenRouter opens Qwen 3.6 Plus Preview with 1M context at $0

OpenRouter made Qwen 3.6 Plus Preview available for free with a 1 million token context window for a limited time. Watch prompt and completion data policies closely, since they may be collected to improve the model.

Gemini 3.1 Flash Live launches with 90.8% audio tool-use score and 128K context

Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.

Release🧠5d ago

Mistral releases Voxtral TTS with 3-second cloning and 68.4% win rate

Mistral's Voxtral TTS splits speech into semantic and acoustic tokens, uses a low-bitrate codec, and claims a 68.4% win rate over ElevenLabs Flash v2.5 on voice cloning with about 3-25 seconds of reference audio. The architecture targets multilingual cloning and higher-quality speech without a fully autoregressive audio stack, so voice teams should compare it against current TTS pipelines.

Release🧠2d ago

Mistral launches Voxtral TTS with 9 languages and 90 ms first audio

Mistral released open-weight Voxtral TTS with low-latency streaming, voice cloning, and cross-lingual adaptation, and vLLM Omni shipped day-0 support. Voice-agent teams should compare quality, latency, and serving cost against closed APIs.

Release🧠5d ago

Cohere releases Transcribe 2B with 4.7% AA-WER and 60x realtime speed

Cohere Transcribe arrived as an open-weights 2B speech model trained on 14 languages and scored 4.7% on Artificial Analysis AA-WER. It pairs near-frontier accuracy with about one second of compute per minute of audio.

Release🧠1d ago

OpenRouter opens Qwen 3.6 Plus Preview with 1M context at $0

OpenRouter made Qwen 3.6 Plus Preview available for free with a 1 million token context window for a limited time. Watch prompt and completion data policies closely, since they may be collected to improve the model.

Release🧠1d ago

Read next

OpenAI Codex plugin adds Claude Code delegation through the app server

OpenAI shipped a Codex plugin for Claude Code that delegates tasks and reviews through the Codex app server with ChatGPT subscriptions. Teams can run Codex alongside Claude Code in the same harness, including parallel background tasks and review workflows.

CodexRelease1d ago

Axios 1.14.1 breaks npm installs with a malicious dependency pull

Developers reported an active supply-chain attack affecting axios 1.14.1 and circulated IOC checks as the warning spread across security and AI circles. Fresh installs or unpinned deploys could resolve the compromised version, so pin and verify dependencies.

ReliabilityBreaking1d ago

Hermes Agent 0.6.0 adds multi-agent profiles and OpenWebUI streaming

Nous Research shipped Hermes Agent 0.6.0 with multi-agent profiles, release docs, and OpenWebUI tool-call streaming through its OpenAI-compatible endpoint. One install can now host separate agents with isolated memory, skills, and gateway connections.

Hermes AgentRelease1d ago

Claude Code adds computer use in research preview for Pro and Max

Anthropic added computer use to Claude Code so the CLI can open apps, click through UIs, and verify what it built. That lets one prompt write, launch, test, and fix GUI apps from the terminal.

Claude CodeRelease1d ago

OpenAI Codex plugin adds Claude Code delegation through the app server

OpenAI shipped a Codex plugin for Claude Code that delegates tasks and reviews through the Codex app server with ChatGPT subscriptions. Teams can run Codex alongside Claude Code in the same harness, including parallel background tasks and review workflows.

Release🤖Codex1d ago

Axios 1.14.1 breaks npm installs with a malicious dependency pull

Developers reported an active supply-chain attack affecting axios 1.14.1 and circulated IOC checks as the warning spread across security and AI circles. Fresh installs or unpinned deploys could resolve the compromised version, so pin and verify dependencies.

Breaking🔒Reliability1d ago

Hermes Agent 0.6.0 adds multi-agent profiles and OpenWebUI streaming

Nous Research shipped Hermes Agent 0.6.0 with multi-agent profiles, release docs, and OpenWebUI tool-call streaming through its OpenAI-compatible endpoint. One install can now host separate agents with isolated memory, skills, and gateway connections.

Release🤖Hermes Agent1d ago

Claude Code adds computer use in research preview for Pro and Max

Anthropic added computer use to Claude Code so the CLI can open apps, click through UIs, and verify what it built. That lets one prompt write, launch, test, and fix GUI apps from the terminal.

Release🤖Claude Code1d ago

In this story

TL;DR
What shipped, and where does it move the benchmark?
What do the new workflows actually look like?
How much of this is ready for real-time apps?
More sources

Share

AI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. Show more

3:15 PM · Mar 30, 2026

Read 139 replies

Demo2：Audio-Visual Vibe Coding

Qwen

@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.

3:19 PM · Mar 30, 2026

Read 29 replies

Demo1：Audio-Visual Captioning

Qwen

@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.

3:17 PM · Mar 30, 2026

Demo5： Voice Style, Emotion and Volume Control

Qwen

@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'.

3:23 PM · Mar 30, 2026

Read 11 replies