releaseMay 2, 2026

OpenRouter launches Response Caching with X-OpenRouter-Cache and 80-300 ms hits

OpenRouter added response caching across chat, responses, messages, and embeddings with per-key isolation, TTL controls, and cached stream replay. The beta matters because identical retries and test runs can return in milliseconds without provider charges or rate-limit hits.

5 min read

OpenRouter launches Response Caching with X-OpenRouter-Cache and 80-300 ms hits

TL;DR

OpenRouter's launch post says Response Caching is now in beta, and identical requests can be replayed with X-OpenRouter-Cache: true at zero token cost after the first billed call.
According to OpenRouter's latency comparison, cache hits return in 80 to 300ms, while uncached calls in its examples took about 1.3s for Gemini 2.5 Flash, 4.6s for Kimi K2.6, and 9.1s for GPT-5.5.
OpenRouter's endpoint list and the official docs both say the feature covers /chat/completions, /responses, /messages, and /embeddings, including cached stream replay and multimodal inputs.
In OpenRouter's prompt-caching note, the company draws a clean line between provider-side prompt caching and its own response cache, which sits in front of the provider and can bypass provider billing and rate limits on a hit.
The documentation adds a few beta gotchas that are easy to miss in the launch thread: only 200 OK responses are cached, concurrent identical requests are not coalesced, and JSON property order can change the cache key.

You can read the announcement, inspect the beta docs, and even spot the cache indicator in OpenRouter's screenshot. The buried details are the interesting part: streamed hits get replayed through the same pipeline, tool-call responses are cacheable, and two identical requests fired at the same time can still both bill as misses because there is no request coalescing.

What shipped

OpenRouter shipped a response cache in front of model providers, not inside any one model backend. The announcement says the cache key includes the request body, model, API key, and streaming mode, so only identical calls hit.

The enablement surface is small:

Add X-OpenRouter-Cache: true per request, per OpenRouter's header example.
Set cache_enabled: true on a preset, according to OpenRouter's controls post and the docs.
Override lifetime with X-OpenRouter-Cache-TTL, from 1 second to 24 hours, per OpenRouter's TTL note.
Bust a single entry with X-OpenRouter-Cache-Clear: true, according to OpenRouter's cache clear note.
Inspect HIT, MISS, Age, and TTL through response headers, as OpenRouter's controls post describes.

Latency and cost profile

The headline claim is simple: once a response is in cache, OpenRouter says the retry path drops from provider latency to edge-cache latency. In OpenRouter's latency comparison, the company pegs the cache lookup itself at 4ms and the end-to-end hit path at 80 to 300ms.

That changes the economics of repeated exact calls more than the mechanics of first calls. OpenRouter's use-case list breaks it into three buckets:

Agent retries, where rerunning the same earlier steps only bills the new work.
Test suites, where the first successful run seeds the cache and later runs are deterministic and free.
Repeated context calls, where the same prompt, model, and params only pay once.

The announcement also says cache hits do not count against provider rate limits because the request never reaches the provider.

Prompt caching versus response caching

OpenRouter is explicit that this is a different layer than prompt caching. In OpenRouter's note on prompt caching, prompt caching discounts shared prefixes inside a request, while response caching returns the full completed response from OpenRouter without touching the provider.

That distinction matters because the two caches stack. The docs say provider caching still operates inside the vendor's infrastructure, while OpenRouter caching happens before the provider call.

Cache-key mechanics

The docs hide the most practical implementation detail: "identical" is stricter than most teams will assume. The response-caching guide says the key includes the API key, model, endpoint type, streaming mode, and a SHA-256 hash of the normalized request body.

A few consequences fall straight out of that design:

Different API keys never share cache, even under the same account, as OpenRouter's per-key isolation note also states.
stream: true and non-streaming calls are cached separately, per the docs.
Omitting a default field and explicitly sending it can produce different keys, according to the docs.
JSON property order is significant, so logically identical bodies can still miss if the serializer reorders fields, per the docs.
Attribution headers such as HTTP-Referer and X-Title are not part of the key, according to the docs.

Beta caveats

The beta label is doing some real work here. The docs say only successful 200 OK responses are cached, while errors, rate-limit responses, and partial outputs are never stored.

Two more caveats are easy to miss:

Concurrent identical requests are not coalesced, so two requests that arrive before the first write completes can both miss and bill separately, per the docs.
Very large multimodal payloads that get offloaded internally are not eligible for caching, according to the announcement.
Cached streaming responses replay the same content chunks, but the id, created timestamp, and X-Generation-Id reflect a new cache-hit generation record, per the docs.
Rotating an API key starts with an empty cache, because cache scope is tied to the key, according to the docs.