Skip to content
AI Primer
MODEL19 stories

Qwen

Stories, products, and related signals connected to this tag in Explore.

NEWS17th May
Practitioners benchmark Qwen3.6 and Gemma 4 at 40-65 tok/s on M3 Ultra, iPhone 17 Pro, and 4x A4000

New reports show Qwen3.6 and Gemma 4 running locally across Apple and Nvidia setups, with wide variance tied to context length, runtime choice, and MTP tuning. This matters because the latest open models are reaching usable agent speeds on consumer hardware, but prefill and long-context performance still cap throughput.

RELEASE15th May
Unsloth updates Qwen3.5 MTP GGUFs with draft-mtp flags for 1.8x speed

Unsloth said its updated Qwen3.5 MTP GGUFs now run about 1.8x faster after llama.cpp added spec-draft-p-min 0.75 and renamed the mode to draft-mtp. The update also raises draft-token settings and expands the small-model MTP set for local runners.

NEWS12th May
Perplexity benchmarks Qwen3 235B on GB200 NVL72: NVLS latency drops from 586 µs to 313 µs

Perplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.

NEWS1w ago
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

RELEASE1w ago
DFlash adds Qwen3-8B speculator with 82.2% first-token acceptance

Posts said Qwen3-8B now has a DFlash speculator with 82.2% first-token acceptance and 3.74 accepted tokens per step, alongside broader DFlash claims of over 6x lossless acceleration. It matters because the release turns a decoding paper into a concrete speculative-inference artifact engineers can test against existing Qwen stacks.

RELEASE2w ago
Qwen-Scope releases SAE toolkit for Qwen3.5-27B steering

Alibaba’s Qwen team released Qwen-Scope, an open sparse-autoencoder suite for Qwen3.5-27B that can steer outputs, surface repetition features, and compare benchmark feature overlap. The toolkit turns interpretability artifacts into debugging, data-generation, and evaluation workflows.

RELEASE2w ago
FlashQLA releases TileLang linear-attention kernels with 2–3x forward speedups

Alibaba Qwen introduced FlashQLA, a TileLang-based linear-attention kernel stack that reports 2–3x faster forward passes and 2x faster backward passes. The release gives edge and long-context deployments a new optimization lever below the model layer itself.

NEWS3w ago
Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.

RELEASE3w ago
Qwen-Image-2.0-Pro launches at #9 on Arena with multilingual text rendering

Alibaba launched Qwen-Image-2.0-Pro on ModelScope and API with better prompt adherence, multilingual typography, and steadier style quality. The model is aimed at text-heavy jobs like UI mockups and posters, so test it for layout-heavy generation.

RELEASE3w ago
Qwen3.6-27B releases with 77.2 SWE-Bench Verified and Apache 2.0

Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.

RELEASE4w ago
Qwen launches Qwen3.6-Max-Preview on Qwen Chat with AA Index 52

Qwen put Qwen3.6-Max-Preview live on Qwen Chat as an early flagship preview with stronger agentic coding and world-knowledge claims. Early testers report strong first-pass results, but the Max line remains closed rather than open-sourced.

NEWS4w ago
Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.

WORKFLOW4w ago
Unsloth benchmarks Qwen3.6-35B-A3B GGUF quants at 20-40 tok/s on local rigs

Unsloth published GGUF quant benchmarks for Qwen3.6-35B-A3B while practitioners shared local setup guides and long-context agent runs on Apple silicon and high-RAM desktops. The sparse 35B model is becoming a credible local coding-agent option, but speed and reasoning quality still vary by quant and offload strategy.

RELEASE4w ago
Qwen3.6-35B-A3B releases Apache 2.0 sparse MoE with 3B active params

Alibaba open-sourced Qwen3.6-35B-A3B, a 35B multimodal sparse MoE with only 3B active parameters under Apache 2.0. Same-day support from vLLM, Ollama, SGLang, and GGUF builders makes it immediately usable for local and production coding workloads.

RELEASE1mo ago
Qwen Code updates v0.14.2 with Channels, Cron Jobs, and Qwen3.6-Plus

Qwen Code added phone-based control via Telegram, DingTalk, and WeChat, scheduled agent loops, per-subagent model selection, and a planning mode before execution. The release also centers Qwen3.6-Plus, which Alibaba says offers 1M context and 1,000 free daily requests, while Vals ranked the model #17 overall and #11 multimodal.

NEWS1mo ago
OpenRouter says Qwen3.6-Plus hits 1.4T tokens in a day

OpenRouter said Qwen3.6-Plus became its first model to exceed about 1.4 trillion tokens in a day, and Qwen said the model also moved to No. 1 on the service. The milestone adds a concrete deployment signal beyond benchmark scores and preview availability, so track usage data alongside evals.

RELEASE1mo ago
Qwen3.6-Plus launches with 1M context and Code Arena #8 ranking

Alibaba launched Qwen3.6-Plus with a 1M default context window, stronger coding and multimodal performance, and rollout across chat, API, and routing partners. Benchmarks and partner availability make it a new high-end option for agentic coding and web tasks.

RELEASE1mo ago
Qwen releases Qwen3.5-Omni with 10-hour audio and 400s video support

Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.

NEWS1mo ago
ATLAS benchmarks Qwen3-14B at 74.6% LiveCodeBench on one RTX 5060 Ti

The ATLAS harness says a frozen Qwen3-14B Q4 model on one RTX 5060 Ti reached 74.6% pass@1-v(k=3) on LiveCodeBench v5 through multi-pass repair and selection. The result shifts comparison toward harness design, though HN commenters note it is not a one-shot head-to-head with hosted frontier models.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.