Fugu Ultra testers report 30-minute runs and 17x GLM cost after launch
Sakana launched Fugu Ultra on AI Gateway and published a technical report, with early testers sharing mixed results. Reports mention polished outputs on some tasks, but also 30-minute runs, uneven coding quality, and much higher cost than GLM-5.2.

TL;DR
- Sakana launched SakanaAILabs' launch post and described Fugu Ultra in SakanaAILabs' how-it-works thread as a single OpenAI-compatible endpoint that internally selects, delegates, verifies, and synthesizes across a pool of models.
- The benchmark sheet in hardmaru's technical report post and SakanaAILabs' use-case thread shows Fugu Ultra winning or tying several reasoning and agentic tests, but it also trails Fable 5 on SWEbench Pro and Humanity's Last Exam.
- Early hands-on reports split fast: emollick's 30-minute run report said typical coding tests took 30 minutes and did not match Fable in real use, while rohanpaul_ai's Atomic test summary said Fugu produced the most polished trading-desk UI in a live coding benchmark.
- Cost is already part of the story, because ai_for_success's Atomic comparison and rohanpaul_ai's Atomic test summary both said GLM 5.2 came close on quality at roughly 17 times lower cost.
- Sakana's pitch around sovereignty and export controls in SakanaAILabs' launch blog thread ran straight into criticism from eliebakouch's tech-report critique and BlancheMinerva's reply, both of whom argued that a closed router over third-party cloud models does not remove vendor dependency.
You can read the technical report, scan the release notes, and check the AI Gateway listing. Sakana also posted a long use-case thread covering autonomous ML research, blindfold chess, CAD, and a Rubik's Cube solver, while kimmonismus' region-block screenshot showed the launch was geo-blocked in the EU and EEA.
Single API, many hidden calls
Fugu's core move is not a new base model reveal. In SakanaAILabs' how-it-works thread, Sakana says Fugu is itself an LLM trained to call models in an agent pool, including instances of itself recursively.
The launch materials break that orchestration into four jobs:
- model selection
- delegation
- verification
- synthesis
Sakana split the product into two SKUs in the same thread: Fugu for lower-latency everyday work, and Fugu Ultra for deeper multi-step tasks like research, cybersecurity analysis, and patent investigations. Vercel's vercel_dev launch post confirmed Fugu Ultra was exposed as a single model slug, sakana/fugu-ultra, on AI Gateway.
Benchmarks are strong, not clean
The benchmark grid in the technical report post is the cleanest snapshot of what moved. Fugu Ultra posts strong wins on LiveCodeBench, GPQA-D, CharXiv Reasoning, and Terminal Bench 2.1, while the same chart also shows obvious misses against frontier baselines on some coding and knowledge-heavy tests.
A few numbers matter most:
- LiveCodeBench: Fugu Ultra 93.2, ahead of Fable 5 at 89.8.
- GPQA-D: Fugu Ultra 95.5, ahead of Mythos Preview at 94.6.
- Terminal Bench 2.1: Fugu Ultra 82.1, just above Fable 5 at 80.4.
- SWEbench Pro: Fugu Ultra 73.7, behind Fable 5 at 80.0.
- Humanity's Last Exam, text: Fugu Ultra 50.0, behind Fable 5 at 53.3.
- CTI-REALM: Fugu Ultra 69.4, slightly behind Opus 4.8 at 69.6.
That shape matters more than the headline. Even Sakana's own chart in SakanaAILabs' use-case thread reads like a system that spikes on orchestration-friendly tasks rather than sweeping every benchmark on the board.
Early testers found polish, latency, and jaggedness
The first outside usage reports are not subtle. emollick's 30-minute run report said typical coding tests for shaders and interactive scenes took 30 minutes and produced results that were "fine," not Fable-level.
rohanpaul_ai's Atomic test summary and ai_for_success's Atomic comparison land on a different angle. In a live trading-desk coding task, Fugu Ultra produced the richest interface with multiple panels, charts, watchlists, and a more finished feel, even though GLM 5.2 came close overall.
The user pattern so far looks split by task:
- HamelHusain's plugin experiment said the model felt jagged, strong for code review but weaker for frontend work.
- emollick's 30-minute run report described long waits and underwhelming real-use performance.
- rohanpaul_ai's Atomic test summary credited Fugu with the best visual polish in a complex app build.
Cost and transparency became day-one objections
The loudest technical criticism is not that orchestration exists. It is that Sakana did not disclose enough about cost, token usage, or the mix of models behind the scorecard.
In eliebakouch's tech-report critique, eliebakouch argued that Fugu is effectively a closed orchestrator over closed models, that the report does not say how much test-time compute was spent per benchmark, and that the unnamed "Model A, B and C" setup in some use cases muddies comparison. eliebakouch's pricing question explicitly asked for per-benchmark model mix, output tokens, and task cost.
The live benchmark chatter reinforced the cost problem. According to ai_for_success's Atomic comparison, GLM 5.2 got surprisingly close to Fugu Ultra in a trading-desk build while costing roughly 17 times less. daniel_mac8's reply captured the obvious follow-up: paying more than a frontier flagship to match a frontier flagship is a hard sell unless the orchestration layer adds something concrete.
Availability started with AI Gateway and an EU block
Sakana did not just ship a paper. vercel_dev launch post announced Fugu Ultra live on AI Gateway, which made the launch immediately testable through an existing OpenAI-compatible surface.
Availability was narrower than the splashy benchmark thread suggested. testingcatalog's launch summary said the API was not accessible in the EEA, and kimmonismus' region-block screenshot showed the actual error message: "Not yet available in the EU/EEA while we work toward compliance with GDPR and EU-specific regulations." That restriction became part of Sakana's day-one discourse because the company's sovereignty pitch in hardmaru's vision thread landed at the same moment some European users could not access the product at all.