oMLX now supports local Claude Code setups on Apple Silicon with tiered KV cache and an Anthropic Messages API-compatible endpoint, with one setup reporting roughly 10x faster performance than mlx_lm-style serving. If you want private on-device coding agents, point Claude Code at a local compatible endpoint and disable the attribution header to preserve cache reuse.

CLAUDE_CODE_ATTRIBUTION_HEADER=0 so repeated prompts keep hitting cache.Claude Code does not need a special local-only integration here. The key detail from the setup thread is that it will send requests to “any backend that implements the Anthropic Messages API,” with ANTHROPIC_BASE_URL redirected to a localhost endpoint. That makes the integration surface fairly simple for local serving stacks: if they mimic the Messages API, Claude Code can sit on top.
The same thread adds a deployment-specific gotcha. Claude Code’s attribution header “breaks prefix consistency and invalidates the KV cache,” so this setup disables it with CLAUDE_CODE_ATTRIBUTION_HEADER=0 header workaround. That detail matters because the local-agent story here is not just privacy or zero API cost; it is whether the request stream stays cache-friendly enough to keep interactive coding latency down.
The reported bottleneck was prefill, not raw model quality. In the technical explanation, the user says mlx_lm was not reusing KV cache, so each request had to rerun the full prefill even when the system prompt stayed fixed. After switching to oMLX, which is described there and in the repo as an Apple Silicon inference server with persistent tiered KV caching and continuous batching, “most tokens are now served directly from cache.”
That is the basis for the claimed “~10× faster” result in this single setup speed claim thread. The thread also points to a hardware-fit tool for choosing a model your machine can actually sustain, with Qwen3.5 9B cited as the recommendation for one Mac Studio configuration model recommendation.
Anthropic is testing a new /init flow that interviews users and configures Claude.md, hooks, and skills in new or existing repos. Try it in a sandbox repo, then watch for skills behavior differences between chat and web surfaces.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
3/ So how does Claude Code run with local models? It just sends requests to any backend that implements the Anthropic Messages API. By setting: ANTHROPIC_BASE_URL=http://localhost:8000 Claude Code routes requests to your local inference server instead of Anthropic. One more Show more
2/ The real bottleneck wasn’t the model. It was the inference layer. With mlx_lm, the KV cache wasn’t reused. Every request had to redo the full prefill, even when the system prompt stayed the same. Switching to oMLX fixed this. It’s an inference server optimized for Apple Show more