breakingJune 8, 2026

Apple claims 20B on-device model uses query-routed experts on iPhone 17 Pro

Apple said its most powerful on-device model runs on iPhone 17 Pro, while independent analysis describes a 20B design that routes a query to experts loaded from NAND into RAM. The architecture matters because it trades dense inference for hardware-aware expert selection, but access is constrained by device and region limits.

3 min read

Apple claims 20B on-device model uses query-routed experts on iPhone 17 Pro

TL;DR

Apple says AFM 3 Core Advanced in its research post is a 20 billion parameter on-device model that activates only 1 to 4 billion parameters per request, while awnihannun's thread framed the design as a flash-backed expert system built for phone-class memory limits.
According to Apple's architecture writeup, the full model lives in NAND and a lightweight dense block picks which experts to load into DRAM, which is the core trick behind fitting a much larger model on device than normal dense inference would allow.
awnihannun's thread described the routing as a once-per-query decision, but Apple's own description says the model can periodically reselect experts during generation.
Availability is tight: kimmonismus on device limits pointed to iPhone 17 Pro gating for the most powerful on-device model, and 9to5Mac's EU coverage reports the new Siri AI iPhone rollout is delayed in the EU.

You can read Apple's full model overview, skim the older Instruction-Following Pruning paper it cites, and the WWDC Foundation Models session quietly adds two practical details: Apple rebuilt the on-device model from the ground up, and the framework now exposes context-size and token-count APIs so apps can adapt prompts to the hardware they are running on. testingcatalog's keynote thread also caught Apple presenting the new stack as a mix of Apple Foundation Models and Gemini-backed systems.

AFM 3 Core Advanced

Apple's research post names the new model AFM 3 Core Advanced and says it is natively multimodal, aimed at features like expressive voices and higher-accuracy dictation in the next Siri stack. In the same post, Apple separates it from AFM 3 Core, a 3 billion parameter dense model, which makes the 20B variant the interesting part rather than a blanket upgrade across every supported device.

NAND and routed experts

The architectural move is simple enough to be memorable:

The full 20B model is stored in flash memory, not kept resident in DRAM, per Apple's model overview.
A lightweight dense block selects a fixed set of routed experts for the prompt, alongside always-active shared experts, per Apple's model overview.
Apple says this avoids token-by-token weight swapping, which would be too slow over NAND-to-DRAM bandwidth.
The official writeup adds one extra wrinkle that the early tweet summary flattened: experts are chosen during initial processing and can be periodically reselected during generation.

That last point matters because it makes the design less like a one-shot compile step and more like a coarse-grained routing system tuned around storage bandwidth.

Device and region limits

The hardware gate arrived immediately in community reaction. kimmonismus on device limits called out Apple's claim that its most powerful on-device AI model runs on iPhone 17 Pro, while Apple's Apple Intelligence compatibility page shows much broader baseline Apple Intelligence support than this specific top-end model tier implies.

The rollout is also uneven by region. 9to5Mac's report on EU availability says the new Siri AI iPhone experience is not launching in the EU at the same time, even as Apple's developer and newsroom materials pitch the new assistant as the headline product surface for these models.