updateApril 19, 2026

Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.

4 min read

Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

TL;DR

The main Hacker News thread turned into a practical benchmark log for Qwen3.6-35B-A3B, with one early hands-on report clocking about 40 tok/s on an M3 Ultra Mac Studio and saying tool use stayed stable past 100k tokens.
Fresh follow-ups in the latest HN delta pushed the conversation from first-run excitement into hardware fit, with Strix Halo tests surfacing alongside a more skeptical note that 16GB is the extreme lower end for useful coding runs.
Local harness work showed up fast: onusoz's OpenClaw post describes channel-level model switching and memory-aware loading, while badlogicgames' repost of a Pi plus llama.cpp setup points to the same model getting wired into agent shells rather than benchmark scripts.
The performance story is not uniform, because a LocalLLM first-impressions post reported just 11.3 tok/s with tool use on a MacBook Pro M5 64GB, while a LocalLLaMA user running a 40 GB VRAM setup said a Q4 XL quant felt like the first genuinely usable local coding model they had tried.

You can read the official Qwen release post, skim the main HN thread, and compare that against community reports from LocalLLaMA and localLLM. There is also an OpenClaw model-switching post and a Pi plus llama.cpp setup repost, which is where this stops looking like a launch chart and starts looking like a local agent stack.

M3 Ultra and Strix Halo

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Useful as a practical benchmark thread: people are testing Qwen3.6-35B-A3B in local agent loops, comparing runtimes/quantizations, and discussing the memory and speed envelope needed for coding workflows.

Fresh discussion on Qwen3.6-35B-A3B: Agentic coding power, now open to all

Today’s new signal is mostly about where the model sits on practical hardware and how much supervision it needs. One commenter says 16GB appears to be the extreme lower end for getting useful results, likening it to a junior developer that still needs a tight code-test-fix loop. Another fresh note is that people are now testing it on AMD Strix Halo systems and looking at performance claims from different inference stacks, suggesting the discussion has moved further into real deployment and backend tuning rather than just initial curiosity.

The strongest recurring number is still the M3 Ultra report. In the HN discussion highlights, qazplm17 said Qwen3.6 ran at about 40 tok/s on a Mac Studio and kept tool use coherent even after 100k tokens.

The newer signal is where else people are trying it. The latest HN delta says Strix Halo results were coming in line with expectations, which shifts the conversation from one standout Apple Silicon run to backend and hardware tuning across platforms.

That range matters because the thread is not describing a single clean envelope. The MacBook Pro M5 post said tool use felt sluggish at 11.3 tok/s, while the LocalLLaMA report described 50 to 60 tok/s with filled context on a 40 GB VRAM setup using Open Code and extra MCPs.

Harnesses and model switching

The most useful operational detail came from onusoz's OpenClaw thread, which lays out a model picker, a /model command, and a controller that either loads the selected model into memory or throws an insufficient-memory error.

That post also breaks the local serving problem into concrete pieces:

switch models per channel
load and unload them automatically
compare runtimes like llama-swap, LM Studio, Ollama, and vLLM
benchmark weight formats and quants instead of treating one GGUF as representative
preserve enough headroom for multiple parallel generations

A parallel setup thread showed the same model moving into Pi-style agent workflows. badlogicgames' repost points to notes for running pi-coding-agent with local Qwen3.6 on an M4 Max 128GB, and another repost of the Pi plus llama.cpp guide signals that the setup was spreading beyond one user's experiment.

Memory floor and supervision

Fresh discussion on Qwen3.6-35B-A3B: Agentic coding power, now open to all

Today’s new signal is mostly practical validation from users trying the model in live agentic workflows. One commenter reports running Qwen3.6-35B-A3B BF16 with omlx on an M3 Ultra Mac Studio at roughly 40 tokens/sec and says tool use stayed stable even over 100k tokens, calling it their smoothest local agent session so far. Another says it is the most capable local model they have tried at this size, especially for breaking tasks into steps and asking clarifying questions. The other fresh theme is real-world deployability: people ask what hardware it needs, whether it is available through aggregators, and how it compares with Opus-class models on a MacBook M3 Max. One reply suggests a low-bit quant with partial offload can fit on high-end consumer rigs and still run around 20 tps, while another points to quantization/fork issues as a bottleneck for local model users. There is also renewed skepticism that benchmark gains map to agentic usefulness, with a few commenters calling out “benchmaxxing.”

Community reports converged on one less glamorous point: Qwen3.6 looks much better when the hardware budget is not tight. In the April 17 HN delta, one commenter put useful low-bit offload runs at around 120GB RAM for a stronger experience, and the latest follow-up says 16GB is the extreme lower end.

The quality reports follow the same pattern. The LocalLLaMA post called it the first local coding model that consistently handled architecture, implementation, and debugging on that user's box, while mervenoyann's OpenClaw testing note said Qwen in Q6_K stayed surprisingly accurate but lost some of the character and friendliness other models kept.

Quant work is arriving quickly around that reality. danielhanchen's repost amplified a 2-bit variant that allegedly fit in 13 GB RAM, and UnslothAI's benchmark post focused on KL-divergence rankings across GGUF sizes rather than agent behavior.

Harness edge cases

r/LocalLLaMA

Qwen 3.6 CoT issue?

0 comments

Fresh discussion on Qwen3.6-35B-A3B: Agentic coding power, now open to all

The main genuinely new signal today is a concrete security concern raised for agentic use: one commenter warns that prompt-injection-style attacks could destabilize the model, especially if an attacker can steer it with politically sensitive Chinese topics. Their concern is not just refusal behavior, but that the model may answer with outright falsehoods, which could be risky in workflows where it reads code, commit messages, or other untrusted text. The rest of today’s discussion is still centered on hands-on validation: one user reports a very smooth local agent session at around 40 tok/s on an M3 Ultra Mac Studio with stability over 100k tokens, reinforcing the “this is usable locally” theme rather than changing it.

Two of the more interesting caveats came from people building harnesses, not from benchmark charts. A LocalLLaMA post described Qwen occasionally ending a chain-of-thought block with an unexpected multi-token sequence, which broke a harness that was expecting a single delimiter token and surfaced as an API-style failure.

A separate concern in the April 18 HN delta was prompt injection inside agent loops. One commenter argued that untrusted text, including politically sensitive strings in commit messages, could destabilize outputs in ways that matter more for coding agents than for ordinary chat use.

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads

Harnesses and model switching2 posts

Memory floor and supervision2 posts

On Hacker News· 1 post

Discussion around Qwen3.6-35B-A3B: Agentic coding power, now open to all1.3k upvotes · 531 comments

On Reddit· 2 posts

r/LocalLLaMA

Trie the new Qwen-3.6-35B-A3B if you can fit it into VRAM

1 comments

r/localLLM

Running Qwen 3.6 35B-A3B-4b on MacBook Pro M5 64GB - first impressions

0 comments