Skip to content
AI Primer
breaking

Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.

5 min read
Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs
Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

TL;DR

You can read Qwen's official launch post, skim the big HN discussion, and check out both the Mac 3-bit mixed quant and the LocalLLaMA ngram-mod benchmark thread. There is already enough post-launch community evidence to talk about deployment paths by hardware class, not just launch-day claims.

MLX and 3-bit quants

r/LocalLLaMA

Qwen3.6-27B-3bit-mlx · Hugging Face: 3 & 5 mixed quant for RAM poor Mac users.

0 comments

The first real wave after Qwen's official announcement was packaging, not theory. One branch targeted Apple Silicon with MLX quants, while another squeezed the 27B model into a mixed 3-bit format for lower-RAM Macs.

JLeonsarmiento said the mixed quant keeps 5-bit precision for embeddings and prediction layers, claimed it was roughly twice as fast as the other available 3-bit Mac option, and pointed users to LM Studio's preserve_thinking template flag in the same post. That is the useful bit here: the community is already shipping opinionated local presets, not just raw weights.

Benchmarks on M3, RTX 5070, and Radeon AI Pro

Y
Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 532 comments

The cross-hardware picture is unusually concrete for a model this new.

That spread matters because it turns Qwen3.6 from a single benchmark headline into three different local deployment lanes: Macs running MLX or GGUF, upper-midrange consumer Nvidia cards running long context, and AMD workstations with llama.cpp tuning.

Agent loops and reliability

Y
Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 532 comments

r/ClaudeCode

The future is local

19 comments

The strongest community praise was not raw reasoning. It was agent behavior.

According to the main HN thread, users highlighted three things:

  1. tool-use consistency held up better than expected,
  2. the model broke tasks into smaller actionable steps,
  3. it asked clarifying questions instead of bulldozing ahead.

That still came with a familiar caveat. In the ClaudeCode Reddit thread, one commenter argued the real gap for local models is reliable tool-call execution across long contexts, saying they handle single-file work but start losing state or dropping tool calls after a few turns in a real agent loop.

The same thread put rough memory expectations on the table: the original post said a 24 GB M4 Pro could run an abliterated quant, but described 32 GB to 48 GB as the more comfortable range, especially for broader workflows.

Dense versus MoE expectations

Y
Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 532 comments

The HN thread also exposed a split that launch posts rarely dwell on. Qwen3.6-35B-A3B looked efficient enough that several commenters treated it as a serious local coding model, but at least one high-ranking reply still said they wished Qwen had shipped a 27B dense model instead.

That preference was not framed as nostalgia. It was about predictability. In the dense-versus-MoE subthread, the concern was that medium dense models remain easier to trust for local runs even when newer MoE models post better capability and efficiency numbers.

Ngram-mod on repeat codebases

r/LocalLLaMA

Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B

0 comments

The most specific post-launch tuning result came from exact_constraint's Radeon run in llama.cpp. They tested --spec-type ngram-mod during an OpenCode bug-chasing session and said it gave a nice speed increase when working on the same codebase.

The raw numbers were messy enough to be useful. exact_constraint's LocalLLaMA benchmark dump reported mean prompt processing of 549.60 t/s, median generation of 28.20 tok/s, a 45.34 tok/s P95 generation tail, and labeled both prompt and generation stability as jittery. That is a better picture than a single headline tok/s number, because it shows the trade: faster reuse on recurring code context, but uneven latency from turn to turn.