Skip to content
AI Primer
update

Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.

4 min read
Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups
Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

TL;DR

  • The main Hacker News thread turned into a practical benchmark log for Qwen3.6-35B-A3B, with one early hands-on report clocking about 40 tok/s on an M3 Ultra Mac Studio and saying tool use stayed stable past 100k tokens.
  • Fresh follow-ups in the latest HN delta pushed the conversation from first-run excitement into hardware fit, with Strix Halo tests surfacing alongside a more skeptical note that 16GB is the extreme lower end for useful coding runs.
  • Local harness work showed up fast: onusoz's OpenClaw post describes channel-level model switching and memory-aware loading, while badlogicgames' repost of a Pi plus llama.cpp setup points to the same model getting wired into agent shells rather than benchmark scripts.
  • The performance story is not uniform, because a LocalLLM first-impressions post reported just 11.3 tok/s with tool use on a MacBook Pro M5 64GB, while a LocalLLaMA user running a 40 GB VRAM setup said a Q4 XL quant felt like the first genuinely usable local coding model they had tried.

You can read the official Qwen release post, skim the main HN thread, and compare that against community reports from LocalLLaMA and localLLM. There is also an OpenClaw model-switching post and a Pi plus llama.cpp setup repost, which is where this stops looking like a launch chart and starts looking like a local agent stack.

M3 Ultra and Strix Halo

Y
Hacker News

Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 531 comments

Y
Hacker News

Fresh discussion on Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 531 comments

The strongest recurring number is still the M3 Ultra report. In the HN discussion highlights, qazplm17 said Qwen3.6 ran at about 40 tok/s on a Mac Studio and kept tool use coherent even after 100k tokens.

The newer signal is where else people are trying it. The latest HN delta says Strix Halo results were coming in line with expectations, which shifts the conversation from one standout Apple Silicon run to backend and hardware tuning across platforms.

That range matters because the thread is not describing a single clean envelope. The MacBook Pro M5 post said tool use felt sluggish at 11.3 tok/s, while the LocalLLaMA report described 50 to 60 tok/s with filled context on a 40 GB VRAM setup using Open Code and extra MCPs.

Harnesses and model switching

The most useful operational detail came from onusoz's OpenClaw thread, which lays out a model picker, a /model command, and a controller that either loads the selected model into memory or throws an insufficient-memory error.

That post also breaks the local serving problem into concrete pieces:

  • switch models per channel
  • load and unload them automatically
  • compare runtimes like llama-swap, LM Studio, Ollama, and vLLM
  • benchmark weight formats and quants instead of treating one GGUF as representative
  • preserve enough headroom for multiple parallel generations

A parallel setup thread showed the same model moving into Pi-style agent workflows. badlogicgames' repost points to notes for running pi-coding-agent with local Qwen3.6 on an M4 Max 128GB, and another repost of the Pi plus llama.cpp guide signals that the setup was spreading beyond one user's experiment.

Memory floor and supervision

Y
Hacker News

Fresh discussion on Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.2k upvotes · 523 comments

Community reports converged on one less glamorous point: Qwen3.6 looks much better when the hardware budget is not tight. In the April 17 HN delta, one commenter put useful low-bit offload runs at around 120GB RAM for a stronger experience, and the latest follow-up says 16GB is the extreme lower end.

The quality reports follow the same pattern. The LocalLLaMA post called it the first local coding model that consistently handled architecture, implementation, and debugging on that user's box, while mervenoyann's OpenClaw testing note said Qwen in Q6_K stayed surprisingly accurate but lost some of the character and friendliness other models kept.

Quant work is arriving quickly around that reality. danielhanchen's repost amplified a 2-bit variant that allegedly fit in 13 GB RAM, and UnslothAI's benchmark post focused on KL-divergence rankings across GGUF sizes rather than agent behavior.

Harness edge cases

r/LocalLLaMA

Qwen 3.6 CoT issue?

0 comments

Y
Hacker News

Fresh discussion on Qwen3.6-35B-A3B: Agentic coding power, now open to all

1.3k upvotes · 528 comments

Two of the more interesting caveats came from people building harnesses, not from benchmark charts. A LocalLLaMA post described Qwen occasionally ending a chain-of-thought block with an unexpected multi-token sequence, which broke a harness that was expecting a single delimiter token and surfaced as an API-style failure.

A separate concern in the April 18 HN delta was prompt injection inside agent loops. One commenter argued that untrusted text, including politically sensitive strings in commit messages, could destabilize outputs in ways that matter more for coding agents than for ordinary chat use.

Further reading

Discussion across the web

Where this story is being discussed, in original context.