Developers compare 128GB workstations, M5 Max laptops, and 20/80 local-cloud agent splits
Developers published new local-first agent setups spanning 128GB workstations, M5 Max laptops, local-model checkers, and 20/80 local-cloud splits. The pattern matters because teams are moving extraction, coordination, and offline tasks off frontier APIs while keeping harder reasoning in the cloud.

TL;DR
- pauliusztin_ framed the economic split cleanly: frontier APIs still power the "brain," but high-volume extraction and query work are getting pushed onto open models once workloads hit millions of documents.
- A lightweight local stack is getting easier to stand up, because pauliusztin_ on llm-checker reduces hardware matching to three CLI commands while ggerganov's repost of Clement Delangue points to llama.cpp's MTP support as the speed unlock behind "daily driver" local use.
- The most practical hybrid pattern in the evidence is imjaredz's 20% local, 80% cloud split, where a local CLI agent acts as coordinator and delegates bigger runs to cloud agents.
- Hardware expectations have moved up fast: onusoz called 128 GB RAM the new sweet spot for parallel agent work, while bridgemindai's M5 Max post claimed 30B-class local models were running faster than two DGX Sparks in that specific setup.
- The interesting part is not just inference on a laptop. MrAddams_LibraLogic's HuBrIS post, karanb192's hooks writeup, and DifficultDog8435's SEELS launch all push agent memory, safety policy, and even LoRA training into local or partly local workflows.
You can inspect Agent FM's repo, read karanb192's longer hooks post, browse OpenCode, and even poke at a Python agent that inserts an expensive reviewer only at decision points via ClawCodex. The weird bit is how many of these projects treat local models less like a full replacement and more like a persistent control plane: one box stays on, remembers context, intercepts tool calls, or supervises cheaper workers while cloud models handle the expensive judgment calls.
128 GB boxes
The hardware bar in this sample is not "can it run," it is "can it run several things at once without turning into a science project."
The setup onusoz described is a permanent home workstation in the $3,000 to $5,000 range, plus a weaker laptop or phone for SSH and mosh access. The reason given was parallelism, not vanity: more simultaneous agent work means RAM becomes the bottleneck before portability does.
That lines up with bridgemindai's first M5 Max impression, which claimed Qwen 3.6 35B and Gemma 4 31B were "blazing fast" on a 128 GB MacBook Pro and, in that early test, faster than two stacked DGX Sparks. Clement Delangue's repost of MiMo V2.5-Coder pushed the same memory threshold from another angle, calling 128 GB RAM enough to run one of the best coding models locally.
A countercurrent showed up too. thdxr argued that there are many reasons to run local models, but cost probably is not one of them.
20/80 orchestration
The clearest workflow change here is not full local autonomy. It is local orchestration.
Two patterns repeat across the evidence:
- Keep repetitive, cheap, or stateful work local, like extraction, query, or session coordination.
- Hand harder reasoning to cloud models, especially when architectural judgment matters more than raw tool throughput.
- Use the local process as the master process that launches cloud runs, collects outputs, and merges results back into the working branch.
That is almost exactly how imjaredz described the split: a local CLI agent as "master coordinator," kicking off cloud agents and folding results back in one by one. pauliusztin_ made the economic version of the same point, arguing that once ingestion scales to a million documents, even cheap API extraction becomes hard to justify for an independent engineer.
A Reddit build in the same vein, Icy-Routine242's ClawCodex post, uses a cheap worker for reads, edits, and tests, then pauses to consult a stronger reviewer only at decision points. The repo is public at ClawCodex.
Harnesses and remote control
Once local models become one node in a larger system, the harness matters as much as the model.
OpenCode – Open source AI coding agent
1.3k upvotes · 619 comments
Now you can listen to Claude Code agents, even on remote machines
2 comments
According to the HN summary of OpenCode, practitioners were using OpenCode as a multi-provider harness for llama.cpp, Claude, and Gemini, then extending it with plugins, memory tools, and remote access over Tailscale. The comments called out five concrete traits that made it interesting:
- Multi-provider backends in one interface
- Local model integration
- Remote control through
opencode serveand web access - Plugin-based extensions over IPC
- Known caveats around telemetry, forced proxying, and resource usage
The official project link in the HN item points to OpenCode.
Agent FM takes the same multi-agent reality from the monitoring side. Gold-Juice-6798 built it to narrate Claude Code and Codex sessions, including remote workspaces over SSH, because once six to ten agents are running in parallel the bottleneck becomes tracking which one is blocked or drifting. The app and repo are public at agentfm.ai and GitHub.
Memory and guardrails
The local-first story is broadening from inference into runtime state and policy enforcement.
HuBrIS - Human Brain Inference Storage (give your coding partner an actual memory)
0 comments
HuBrIS, in MrAddams_LibraLogic's LocalLLaMA post, splits memory into two layers: semantic memory for facts and skills, and autobiographical memory for ordered session history. The interesting mechanics are the pruning rules around that memory layer:
- "Dross" removal for zero-value filler
- Subject tagging and dormancy when a topic stops mattering
- Protected key info that a watcher restores if compaction drops it
- Recall tools that can pull structured memory back into context from old sessions
The post also makes the tradeoff explicit. A second metacognitive layer keeps local inference running between turns and makes memory quality dependent on the model handling those background decisions.
On the safety side, karanb192's Claude Code hooks post describes a PreToolUse and PostToolUse interceptor pattern where every tool call is checked outside the model's view. The claimed OWASP-shaped coverage included prompt injection patterns, sensitive file reads, output scanning, excessive agency, system prompt leakage, and tool-call rate limits, with synchronous checks in the 50 to 100 ms range. The code is open at claude-code-hooks.
Local products keep getting weirder
The final reveal is how quickly local tooling is absorbing features that used to belong to cloud apps, including training loops and mobile multimodal assistants.
i made local AI desktop for windows (alpha). 100% local. has a "Teach" button that turns your corrections into training data.
0 comments
Running an LLM completely offline on Android: Pocket LLM now supports voice, OCR, and camera input with Gemma
0 comments
SEELS turns user corrections into a JSONL corpus and then kicks off a local PEFT LoRA run from a desktop app, with no notebook or terminal required, according to DifficultDog8435. The same post says the Windows installer already bundles CUDA runtime and a portable Python sidecar, which is a very different ambition from a thin Ollama wrapper.
On mobile, Ok-Yak7397 shipped an Android assistant using Gemma 4 with offline voice, OCR, camera input, and support for custom LiteRT models. The use cases in the post were not benchmark bait. They were document analysis, travel without connectivity, and hands-free voice interaction with zero network access.
That puts the comparison in a sharper frame. The "local versus cloud" debate in this evidence pool is mostly over. What developers are actually building is a stack where local runtimes own coordination, memory, privacy-sensitive IO, and sometimes fine-tuning, while frontier APIs keep the hardest reasoning tasks.