Model Routing
Choosing, composing, or switching models inside applications.
Stories
Filter storiesOpenRouter said four open-weight models now handle real agentic workloads, and a JPMorgan report put Chinese models at about 45% of platform traffic. The shift matters because teams are optimizing for price, hosting, and task fit instead of defaulting to frontier APIs.
Sakana made Fugu Ultra available through Vercel AI Gateway, while new technical writeups described the trained routing head and multi-step orchestration behind it. The integration matters because teams can invoke Fugu’s model-selection workflow through existing gateway plumbing instead of standing up custom routing.
OpenRouter released an MCP server that lets agents query live model pricing, benchmark scores, provider data, docs, and run test inference from the CLI. That replaces stale model knowledge with current routing data inside long-running agent workflows.
OpenRouter released a dedicated Image API that normalizes request shapes across 30-plus models from eight providers. Agents can inspect limits, passthrough options, streaming, and exact per-call cost without hardcoding vendor quirks.
Kilo Code added an Auto Efficient mode that routes each request to the cheapest model that clears its benchmark bar using public KiloBench results. The router stays session-aware and falls back to stronger paid models when confidence is low.
Sakana launched Fugu Ultra on AI Gateway and published a technical report, with early testers sharing mixed results. Reports mention polished outputs on some tasks, but also 30-minute runs, uneven coding quality, and much higher cost than GLM-5.2.
Morph said its code-serving stack now exposes Qwen, GLM-5.2, MiniMax M3, and DeepSeek v4 with code-tuned speculative decoding. It claims 20-35% higher acceptance than Eagle 3.1 or DFlash, plus kernels for cheaper hardware.
Codex workflows can now run against open-weight models served through compatible Responses API endpoints, with Ollama and vLLM publishing direct paths for GLM-5.2 and Kimi K2.7 Code. That matters because teams can keep the Codex interface while swapping to self-hosted or lower-cost inference backends.
OpenRouter launched Fusion, a server-side panel API that sends prompts to multiple models and combines one answer. Early logs also showed a web-path issue where Fusion still invoked Claude Opus 4.8 as judge and billed for it until API-side control was clarified.
OpenRouter launched Fusion, a server-side panel API that fans prompts to multiple models, judges the outputs, and returns one synthesized answer. The company said DRACO landed within 1% of Fable at roughly half the price, but the published evals do not cover long-horizon tasks.
One day after Fable 5 launched, users reported burning through Max quotas in about 90 minutes while Anthropic told subscribers the model will leave Claude plans on June 23 until capacity improves. If you depend on Fable, plan for quota pressure and route critical jobs elsewhere.
OpenRouter, OpenCode, Lovable, Cline, Browser Use Terminal, Nous Portal, and Venice all added Fable 5 within hours of launch. The rollouts put the model into gateways, coding agents, browser agents, and chat clients on day one.
Vendors pushed routing and spend controls closer to the default app layer, including OpenRouter's cache-hit pricing telemetry and Devin's adaptive routing. The discussion frames model choice more as a budget-control problem than a pure quality setting.
OpenRouter launched Pareto Code, a free experimental coding router that filters by min_coding_score and says it is already handling about 1 billion tokens a day. The release adds a tunable routing path for coding workloads where cost and model quality need to be balanced.
Perplexity said Computer will split tasks between on-device models and frontier cloud models, keeping some data on the local machine while escalating harder work remotely. That matters for privacy-sensitive workflows and for reducing token-heavy cloud usage on laptop-class hardware.
Factory put Router into private preview in its CLI and desktop app to route coding tasks across models, claiming 20-25% lower spend. The launch targets rising agent costs, though session continuity and routing behavior remain active points of debate.
Independent developers shipped sidecars that let Claude Code, Cursor, and Codex share memory, hot-swap model providers, package local projects as apps, and automate browser QA. Try these reusable tools if you want memory, routing, QA automation, and app packaging outside editor-specific features.
OpenRouter released Guardrails to apply budget limits, provider restrictions, zero-data-retention rules, prompt-injection defense, and DLP checks across routed traffic. Google Model Armor and Lakera Guard connectors are in beta, so plan around limited availability.
Independent IDEs, gateways, and agent runtimes rolled out Claude Opus 4.8 within hours of launch, including Cursor, Warp, OpenRouter, and Perplexity. That matters because teams can benchmark or swap the model into existing workflows without waiting for connector lag.
Hermes Agent added a built-in MCP Catalog while separate builders shipped Qwen3.7 Max support, Venice private-model workflows, and Krea 2 image generation. The cluster shows Hermes moving beyond a single-model assistant toward a broader agent shell with tool, model, and media providers.
Ramp data and operator reports said enterprise AI token spending is rising far faster than budget controls and procurement cycles. Teams should plan for routing, cheaper defaults, and spend caps to become core engineering infrastructure.
OpenRouter announced a $113M Series B led by CapitalG and said weekly routed volume grew from 5T to 25T tokens in six months. The funding matters because the company is pitching itself as production infrastructure for multi-model deployments, not just an API convenience layer.
Warp now lets agents connect directly to an OpenRouter endpoint and switch providers through remembered model aliases. The change reduces endpoint setup friction for teams routing across hosted models inside Warp Agent.
Warp Agent now accepts user-supplied OpenAI, Anthropic, and Gemini keys plus OpenAI-compatible endpoints such as OpenRouter and DeepSeek. The change removes the paid-plan requirement for inference access and gives terminal users more routing options.
OpenCode, Kilo, Replicate, and Mastra exposed Gemini 3.5 Flash on launch day across coding agents, routers, and hosted APIs. The fast uptake gives engineers multiple harnesses to test Google's 1M-context model despite mixed first-party app reports.
OpenRouter replaced its old web plugin path with agentic web search and fetch tools that use a common schema across models. Migrate to the new tools if you need multi-search turns, domain filtering, or Parallel/exa-native routing.
Nous Research added SuperGrok support to Hermes Agent, letting users plug a Grok subscription directly into the framework. It broadens Hermes beyond OpenAI runtimes and local setups into another mainstream agent model path.
OpenRouter updated BYOK workspaces so teams can attach multiple provider keys, scope them to specific models or users, and choose prioritized versus fallback use. It changes how rate-limit isolation, dev and prod separation, and failover routing are handled inside one workspace.
Hermes Agent can now route core tool calls through the Codex app-server when it is using OpenAI models. The integration gives Hermes users access to Codex runtime behavior with a `hermes update`, without changing the rest of their agent stack.
Builders shipped pi-treebase, a Miko voice mode for pi-listens, devrage support, and a Japanese OpenCode Go guide after the first Pi extension burst. The releases arrive as Pi’s provider abstraction gets stress-tested by OpenClaw-scale multi-provider use.
OpenCode made Ring 2.6 1T available in the editor with reasoning enabled and free access for a limited period. Follow-on posts from Kilo and others claim frontier-level results on AIME 26, ClawEval, Gaia2-search, and Tau2-Bench Telecom.
Nous said Hermes Agent hit No. 1 among AI apps on OpenRouter after v0.13.0 shipped and added credential pools for rotating provider keys. Independent posts also tracked migrations from OpenClaw and early routing support in the same stack.
OpenRouter released Pareto Code, which routes requests to the cheapest coding model above a chosen score threshold and can re-rank for speed with Nitro. Use the API to trade cost against latency with benchmark-based routing controls.
OpenRouter added response caching across chat, responses, messages, and embeddings with per-key isolation, TTL controls, and cached stream replay. The beta matters because identical retries and test runs can return in milliseconds without provider charges or rate-limit hits.
OpenClaw 2026.4.29 shipped a new group-chat flow, opt-in follow-up commitments, tighter exec controls, and first-class NVIDIA provider catalogs. The release matters because it pushes OpenClaw toward safer multi-user agent workflows instead of single-session chat hacks.
Provider and benchmark trackers listed Grok 4.3 with 1M context and lower token pricing, and OpenRouter and Venice exposed it through their APIs. The model undercuts Opus 4.7 and GPT-5.5 on price while independent evaluations show stronger legal and finance performance than general coding.
OpenClaw 2026.4.27 bundles DeepInfra support, better non-image attachments, explicit forward-proxy routing, and stricter model selection. The update broadens provider access while hardening operator-run deployments against routing and session failures.
Independent guides showed DeepSeek V4 running inside Claude Cowork and Claude Code via Anthropic-compatible endpoints, and Ollama added launch commands for Claude-style wrappers. The workflow matters because teams can keep Claude-centered agent UX while sharply lowering model spend, with provider compatibility and setup still the main caveats.
Hermes now pulls provider model lists from hosted JSON so new releases appear without client updates. The same update batch also auto-switches to a local browser when an agent needs localhost access.
Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.
OpenRouter introduced Workspaces to separate API keys, BYOK, routing, plugins, and observability by environment or team. Billing stays unified at the account level while staging and production settings split cleanly.
A day after Kimi K2.6’s launch, providers and tools opened new access paths including temporary free use in Hermes and Cline plus availability on Replicate, Together, Perplexity, and Tinker. Engineers can test the open model across agent harnesses and hosted runtimes without standing up their own stack first.
GitHub added bring-your-own-model keys to Copilot in VS Code, letting users connect local or cloud providers instead of only bundled models. Teams can keep the Copilot harness while routing prompts through approved backends such as LM Studio or OpenRouter.
OpenRouter added Firecrawl as a search provider, letting models ground responses in scraped full web pages instead of snippet-only search. The launch folds crawling into the existing plugin settings flow and includes a capped free plan on the Firecrawl side.
Kimi K2.6 shipped across vLLM, SGLang, OpenRouter, Baseten, Ollama, OpenCode, Hermes Agent, and Droid within hours of launch. That cuts the usual lag between model release and production trials, so mixed-provider agent stacks can test it sooner.
Hermes Agent added Tool Gateway, bundling 300+ models with web, browser, image, terminal, and TTS tools behind one subscription. Firecrawl, Browser Use, Fal image models, and Gemini Voice shipped at launch.
Anthropic added a beta advisor tool to the Messages API so Sonnet or Haiku can call Opus mid-run inside one request. Anthropic says Sonnet plus Opus scored 2.7 points higher on SWE-bench Multilingual while cutting per-task cost 11.9%.
Hermes Agent now treats Hugging Face as a first-class inference provider and surfaces 28 curated models in its picker, plus a custom path to the broader catalog. That broadens model choice for a persistent local agent workflow without requiring users to wire a provider manually.