Skip to content
AI Primer
release

Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s

Morph released FlashCompact, a specialized compaction model and SDK for coding agents, claiming 33k tokens per second and near-invisible long-context compression. Use it or copy the approach if compaction latency and noisy tool output are blocking longer agent runs.

3 min read
Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s
Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s

TL;DR

  • Morph launched FlashCompact, a specialized context-compaction model for coding agents, and says it runs at “33k tokens/sec,” shrinking 200k tokens to 50k in about 1.5 seconds according to the launch post.
  • In Morph’s technical thread, the company says the model runs on a custom PyTriton stack on H200 GPUs and targets a pain point where compaction in agents can take “2+ minutes” with poor results, as shown in the companion post.
  • Morph says its analysis of “200+ agent sessions” and “40 of the top coding agent harnesses” found that most context bloat comes from tool responses rather than model generation, and that compaction delivered “no performance drop,” plus fewer tokens and fewer steps in the eval summary.
  • Early ecosystem integration is already starting: an OpenClaw PR adds Morph as a compaction provider via /v1/compact, with fallback to LLM summarization and retry logic, though the review also flags some implementation bugs.

What shipped, and why Morph built a dedicated compaction model

FlashCompact is not a general model repurposed for summarization. Morph describes it as “the first specialized model for context compaction,” aimed at shortening agent state fast enough that compaction stops being the bottleneck in long coding runs FlashCompact launch. In Morph’s technical thread, the company says it trained a dedicated compaction model instead of relying on slower generic summarization flows, then served it on “a custom PyTriton based stack on H200.”

The core claim is latency plus compression: “200k → 50k in ~1.5s” at 33,000 tokens per second launch post. That matters because Morph argues current agent compaction often means “waiting 2+ minutes for terrible results,” as its compaction complaint puts it. The company has also published a blog post and a playground for testing the approach.

What the technical evidence says about agent context bloat

Morph says it reviewed more than 200 agent sessions and over 40 coding-agent harnesses, and concluded that “most context bloat comes from tool responses, not model generation” eval summary. That is a practical implementation detail for agent builders: the waste is allegedly in logs, command output, and tool chatter, not only in the model’s own prior turns.

The company’s summary claim is that compaction caused “no performance drop” while reducing both token counts and step counts eval summary. Its linked blog writeup describes two operating modes: objective-mode compaction that strips filler without task guidance, and query-based compaction that keeps or drops details based on the agent’s next step.

There is already some evidence of stack-level adoption. An OpenClaw pull request linked by Morph adds Morph as a compaction provider through /v1/compact, using a pre-compaction hook rather than replacing the whole flow OpenClaw integration. The PR summary says the integration falls back to LLM summarization if Morph is unavailable, and adds exponential-backoff retries, but reviewers also noted a bug where non-retryable errors may still be retried and a quota-wasting edge case when a quality guard re-calls compaction on identical input PR notes.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
What shipped, and why Morph built a dedicated compaction model1 post
Share on X