releaseJune 14, 2026

OpenRouter launches Fusion API with model panels and judge routing

OpenRouter launched Fusion, a server-side panel API that sends prompts to multiple models and combines one answer. Early logs also showed a web-path issue where Fusion still invoked Claude Opus 4.8 as judge and billed for it until API-side control was clarified.

7 min read

OpenRouter launches Fusion API with model panels and judge routing

TL;DR

OpenRouter's new Fusion announcement says one API call can fan a prompt out to multiple models, have a judge extract consensus and contradictions, and return a single answer, a setup that kimmonismus's summary and the Fusion docs both describe as a server-side compound model.
On OpenRouter's DRACO run, the company says a budget panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro came within 1 percent of Fable 5 while beating solo GPT-5.5 and solo Opus 4.8, according to the benchmark post and WesRoth's recap.
Early user logs also showed a messier web launch: teortaxesTex's report said Fusion still invoked Opus 4.8 as judge even with cheap open models selected, while teortaxesTex's later update said arbitrary model configs worked over the API and the problem appeared to be a web-path failure.
The control surface is broader than the launch thread implies: the server-tool docs expose custom analysis_models, a separate judge model, tool_choice: "required", degraded-judge fallback behavior, and up to eight panel models, details echoed by WesRoth's description of custom panels.
The launch also landed as a live argument for mixture-of-agents designs: Jerry Liu framed it as proof that cost and accuracy can come from orchestration rather than one frontier model, while Teknium's Hermes PR note pointed to an in-flight /moa implementation in Hermes Agent.

You can read OpenRouter's full benchmark post, inspect the server-tool docs, browse the public Fusion model page, and compare that official framing with the user billing complaint and Teknium's linked Hermes /moa pull request. The weird bit is that OpenRouter shipped both a strong benchmark story and a same-week clarification that the scary Opus judge behavior seemed limited to the web path, not the API.

Fusion

OpenRouter is packaging Fusion three ways: as the openrouter/fusion model slug, as an openrouter:fusion server tool, and as a plugin config surface in normal completions calls, according to the plugin docs and the launch FAQ.

The pipeline is more structured than a vague ensemble call:

A panel of 1 to 8 models answers in parallel.
Each panel model gets openrouter:web_search and openrouter:web_fetch.
A judge model returns JSON for consensus, contradictions, partial coverage, unique insights, and blind spots.
The outer model writes the final answer from that analysis.

That structure matters because the judge is not a simple voter. The server-tool docs say it can degrade gracefully too: if the panel succeeds but the judge fails, Fusion can still return raw panel responses instead of hard-failing the whole call.

DRACO

OpenRouter benchmarked Fusion on 100 DRACO deep-research tasks and published a tighter table than the social posts. The top result in the official post was Fable 5 plus GPT-5.5, synthesized by Opus 4.8, at 69.0%, versus 65.3% for solo Fable 5.

The launch's most shareable number was the budget panel:

Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro fused by Opus 4.8: 64.7%
Solo GPT-5.5: 60.0%
Solo Claude Opus 4.8: 58.8%
Solo DeepSeek V4 Pro: 60.3%
Solo Fable 5: 65.3%

The post also adds two caveats that got compressed in reposts. First, Fable 5 completed only 93 of the 100 tasks because its content filters blocked seven runs. Second, OpenRouter says DRACO does not cover long-horizon tasks, which is exactly where the company says Fable still looks strongest.

OpenRouter also published an unusually specific contamination note. After enabling web search, the models started finding the DRACO grading rubric online, so the company says it blocked those domains and re-ran the evals with excluded_domains and blocked_domains on its web tools.

Opus 4.8 judge billing

The clean benchmark story ran straight into a launch-day trust problem. In teortaxesTex's original post, API logs from a supposedly cheap open-model setup still showed calls to Opus 4.8 as judge, with no obvious way to disable it.

That complaint matched one line in the official product page that was easy to miss: the Fusion model page says requests are priced as the sum of the underlying completions, not as one model. It also tells users to inspect Activity to see which models actually ran.

Later the same day, teortaxesTex's follow-up said arbitrary model configurations were working over the API and the issue appeared to be a pure web failure. That narrows the problem, but it does not erase the launch-day confusion, because the scary behavior users saw was real enough to show up in logs and billing.

Community replies also converged on the same tradeoff from different angles. kimmonismus's reply said the cost advantage only exists when the panel leans on cheaper specialists, while mbusigin's critique argued the critic model has to be very strong, which pushes cost back up.

Control knobs

The docs expose more tunability than the launch thread made obvious.

analysis_models: choose 1 to 8 participant models.
model: set the judge separately from the panel.
max_tool_calls: cap each panel model and the judge between 1 and 16 tool steps.
max_completion_tokens, reasoning, temperature: forward inner-call settings to panel and judge runs.
tool_choice: "required": force Fusion on every request instead of letting the outer model decide.

Two implementation details stand out in the server-tool docs. Fusion blocks recursive self-invocation with an x-openrouter-fusion-depth header, and partial failures are first-class: some panel models can error while the tool still returns status: "ok" plus surviving responses.

That makes Fusion look less like a novelty router and more like a reusable orchestration primitive. The same docs explicitly position it for research, compare-and-contrast prompts, and expensive-to-be-wrong questions, not for every call.

Mixture of agents

The broader idea was already circulating under other names. Teknium, the creator behind Hermes Agent, said Hermes had previously tried Mixture of Agents as a tool and that models were bad at choosing when to use it, then linked an active Hermes /moa pull request that brings it back as a slash-command preset.

That PR shows a parallel design instinct. According to the GitHub pull request, /moa runs reference models and an aggregator before each main-model iteration, injects private guidance into the normal loop, and works across CLI, dashboard, desktop, and /goal flows.

Jerry Liu, founder of LlamaIndex, gave the sharper market read in Jerry Liu's post: the interesting claim is not just that ensembles can beat one model, but that third parties may be able to build better cost-accuracy curves from mixtures that frontier labs do not directly sell as a single product.

Launch FAQ

The 6/14 FAQ update added three concrete details that were mostly absent from early reaction posts.

Fusion is often 2 to 3 times slower when it is actually invoked, because it waits on multiple model runs and then processes the synthesis step.
OpenRouter says Fusion is not a drop-in coding replacement. Its suggested use is to let a coding model call Fusion selectively for architecture or best-practice research questions.
The benchmark kept tools constant across solo and panel runs: openrouter:web_search, openrouter:web_fetch, and openrouter:bash were available everywhere.

Those details make the launch easier to read. Fusion is a higher-cost, higher-latency orchestration layer with stronger knobs than the teaser posts suggested, and the company is still sorting out what that means across chat, API, and agent-tool surfaces.