oMLX supports Claude Code locally with tiered KV cache and Anthropic Messages API
oMLX now supports local Claude Code setups on Apple Silicon with tiered KV cache and an Anthropic Messages API-compatible endpoint, with one setup reporting roughly 10x faster performance than mlx_lm-style serving. If you want private on-device coding agents, point Claude Code at a local compatible endpoint and disable the attribution header to preserve cache reuse.

TL;DR
- A practitioner setup shows Claude Code can target a fully local backend on Apple Silicon by pointing it at any server that speaks the Anthropic Messages API, instead of Anthropic’s hosted endpoint local backend setup.
- The reported speedup came from swapping the inference layer, not the model: according to the caching breakdown, oMLX restored prefix reuse with “tiered KV caching and continuous batching,” and the user reports roughly “~10× faster” behavior than earlier attempts speed claim thread.
- Claude Code’s default attribution header can break cache consistency in this setup; the workaround in the config notes is to disable it with
CLAUDE_CODE_ATTRIBUTION_HEADER=0so repeated prompts keep hitting cache. - Model fit still depends on local hardware. In the model-fit note, the same user points to a compatibility tool and says a Mac Studio was steered toward Qwen3.5 9B for a realistic local deployment.
How does Claude Code run against a local server?
Claude Code does not need a special local-only integration here. The key detail from the setup thread is that it will send requests to “any backend that implements the Anthropic Messages API,” with ANTHROPIC_BASE_URL redirected to a localhost endpoint. That makes the integration surface fairly simple for local serving stacks: if they mimic the Messages API, Claude Code can sit on top.
The same thread adds a deployment-specific gotcha. Claude Code’s attribution header “breaks prefix consistency and invalidates the KV cache,” so this setup disables it with CLAUDE_CODE_ATTRIBUTION_HEADER=0 header workaround. That detail matters because the local-agent story here is not just privacy or zero API cost; it is whether the request stream stays cache-friendly enough to keep interactive coding latency down.
Why did oMLX speed it up?
The reported bottleneck was prefill, not raw model quality. In the technical explanation, the user says mlx_lm was not reusing KV cache, so each request had to rerun the full prefill even when the system prompt stayed fixed. After switching to oMLX, which is described there and in the repo as an Apple Silicon inference server with persistent tiered KV caching and continuous batching, “most tokens are now served directly from cache.”
That is the basis for the claimed “~10× faster” result in this single setup speed claim thread. The thread also points to a hardware-fit tool for choosing a model your machine can actually sustain, with Qwen3.5 9B cited as the recommendation for one Mac Studio configuration model recommendation.