Morph released FlashCompact, a specialized compaction model and SDK for coding agents, claiming 33k tokens per second and near-invisible long-context compression. Use it or copy the approach if compaction latency and noisy tool output are blocking longer agent runs.

/v1/compact, with fallback to LLM summarization and retry logic, though the review also flags some implementation bugs.FlashCompact is not a general model repurposed for summarization. Morph describes it as “the first specialized model for context compaction,” aimed at shortening agent state fast enough that compaction stops being the bottleneck in long coding runs FlashCompact launch. In Morph’s technical thread, the company says it trained a dedicated compaction model instead of relying on slower generic summarization flows, then served it on “a custom PyTriton based stack on H200.”
The core claim is latency plus compression: “200k → 50k in ~1.5s” at 33,000 tokens per second launch post. That matters because Morph argues current agent compaction often means “waiting 2+ minutes for terrible results,” as its compaction complaint puts it. The company has also published a blog post and a playground for testing the approach.
Morph says it reviewed more than 200 agent sessions and over 40 coding-agent harnesses, and concluded that “most context bloat comes from tool responses, not model generation” eval summary. That is a practical implementation detail for agent builders: the waste is allegedly in logs, command output, and tool chatter, not only in the model’s own prior turns.
The company’s summary claim is that compaction caused “no performance drop” while reducing both token counts and step counts eval summary. Its linked blog writeup describes two operating modes: objective-mode compaction that strips filler without task guidance, and query-based compaction that keeps or drops details based on the agent’s next step.
There is already some evidence of stack-level adoption. An OpenClaw pull request linked by Morph adds Morph as a compaction provider through /v1/compact, using a pre-compaction hook rather than replacing the whole flow OpenClaw integration. The PR summary says the integration falls back to LLM summarization if Morph is unavailable, and adds exponential-backoff retries, but reviewers also noted a bug where non-retryable errors may still be retried and a quota-wasting edge case when a quality guard re-calls compaction on identical input PR notes.
Introducing FlashCompact - the first specialized model for context compaction 33k tokens/sec 200k → 50k in ~1.5s Fast, high quality compaction
So, we trained a specialized model for compaction and made it really fast - outputting at 33,000 tok/sec We built on a custom PyTriton based stack on H200, using a similar inference stack as our FastApply model
We looked at 200+ agent sessions and over 40 of the top coding agent harnesses Most context bloat comes from tool responses, not model generation. Result: → no performance drop → fewer tokens → fewer steps To push performance higher and perform long horizon tasks, agents Show more