breakingApril 26, 2026

Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.

5 min read

TL;DR MLX and 3-bit quants Benchmarks on M3, RTX 5070, and Radeon AI Pro Agent loops and reliability Dense versus MoE expectations Ngram-mod on repeat codebases

Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

TL;DR

ClementDelangue repost of the MLX quants and JLeonsarmiento's 3-bit mixed quant post show the fastest community packaging work moved to Apple Silicon first, with MLX builds for Qwen3.6 and a Mac-focused 3-bit, 5-bit mixed quant already on Hugging Face.
On Nvidia, badlogicgames repost of aphronio's RTX 5070 run reported 56 tok/s decode at 64K context and 858 tok/s prefill, while Hacker News commenters separately described about 40 tok/s on an M3 Ultra Mac Studio via local stacks.
The packaging story is not just about fitting the model in memory. the main HN thread emphasized tool-use consistency and smoother agent sessions, while the ClaudeCode Reddit thread and its comments kept circling back to long-context state loss and dropped tool calls as the harder bottleneck.
AMD users also started sharing reproducible tuning data. exact_constraint's LocalLLaMA benchmark dump tested llama.cpp's --spec-type ngram-mod on Qwen3.6 27B and found a useful speedup on repeat codebase work, but with jittery prompt and generation stability.

You can read Qwen's official launch post, skim the big HN discussion, and check out both the Mac 3-bit mixed quant and the LocalLLaMA ngram-mod benchmark thread. There is already enough post-launch community evidence to talk about deployment paths by hardware class, not just launch-day claims.

MLX and 3-bit quants

r/LocalLLaMA

Qwen3.6-27B-3bit-mlx · Hugging Face: 3 & 5 mixed quant for RAM poor Mac users.

0 comments

The first real wave after Qwen's official announcement was packaging, not theory. One branch targeted Apple Silicon with MLX quants, while another squeezed the 27B model into a mixed 3-bit format for lower-RAM Macs.

JLeonsarmiento said the mixed quant keeps 5-bit precision for embeddings and prediction layers, claimed it was roughly twice as fast as the other available 3-bit Mac option, and pointed users to LM Studio's preserve_thinking template flag in the same post. That is the useful bit here: the community is already shipping opinionated local presets, not just raw weights.

Benchmarks on M3, RTX 5070, and Radeon AI Pro

Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 532 comments

The cross-hardware picture is unusually concrete for a model this new.

RTX 5070: aphronio's run, via badlogicgames repost of aphronio's RTX 5070 run, reported 56 tok/s decode and 858 tok/s prefill at 64K context.
M3 Ultra Mac Studio: a top HN comment cited about 40 tok/s and called it "by far my smoothest agentic session using a local model."
Laptop-class local run: another HN comment said the Unsloth 20.9 GB GGUF was already workable in LM Studio on a laptop.
Radeon AI PRO R9700: exact_constraint's LocalLLaMA benchmark dump measured a 31.26 t/s tg128 baseline in llama-bench before switching to the speculative setup.

That spread matters because it turns Qwen3.6 from a single benchmark headline into three different local deployment lanes: Macs running MLX or GGUF, upper-midrange consumer Nvidia cards running long context, and AMD workstations with llama.cpp tuning.

Agent loops and reliability

Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 532 comments

r/ClaudeCode

The future is local

19 comments

The strongest community praise was not raw reasoning. It was agent behavior.

According to the main HN thread, users highlighted three things:

tool-use consistency held up better than expected,
the model broke tasks into smaller actionable steps,
it asked clarifying questions instead of bulldozing ahead.

That still came with a familiar caveat. In the ClaudeCode Reddit thread, one commenter argued the real gap for local models is reliable tool-call execution across long contexts, saying they handle single-file work but start losing state or dropping tool calls after a few turns in a real agent loop.

The same thread put rough memory expectations on the table: the original post said a 24 GB M4 Pro could run an abliterated quant, but described 32 GB to 48 GB as the more comfortable range, especially for broader workflows.

Dense versus MoE expectations

Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 532 comments

The HN thread also exposed a split that launch posts rarely dwell on. Qwen3.6-35B-A3B looked efficient enough that several commenters treated it as a serious local coding model, but at least one high-ranking reply still said they wished Qwen had shipped a 27B dense model instead.

That preference was not framed as nostalgia. It was about predictability. In the dense-versus-MoE subthread, the concern was that medium dense models remain easier to trust for local runs even when newer MoE models post better capability and efficiency numbers.

Ngram-mod on repeat codebases

r/LocalLLaMA

Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B

0 comments

The most specific post-launch tuning result came from exact_constraint's Radeon run in llama.cpp. They tested --spec-type ngram-mod during an OpenCode bug-chasing session and said it gave a nice speed increase when working on the same codebase.

The raw numbers were messy enough to be useful. exact_constraint's LocalLLaMA benchmark dump reported mean prompt processing of 549.60 t/s, median generation of 28.20 tok/s, a 45.34 tok/s P95 generation tail, and labeled both prompt and generation stability as jittery. That is a better picture than a single headline tok/s number, because it shows the trade: faster reuse on recurring code context, but uneven latency from turn to turn.