Skip to content
AI Primer
breaking

The Information reports OpenAI cuts inference costs by more than 50% on some models

Multiple summaries of The Information report said OpenAI found inference optimizations that more than halved costs on some existing models. If that holds, it changes the margin, pricing, and usage-limit math behind ChatGPT and API serving even before new model releases arrive.

4 min read
The Information reports OpenAI cuts inference costs by more than 50% on some models
The Information reports OpenAI cuts inference costs by more than 50% on some models

TL;DR

You can read the leaked screenshot of The Information's report, compare it with a second summary that adds margin numbers, and see a useful caveat on premium low-latency serving. The Rundown AI also tied the report to Dario Amodei's "compute multipliers" remark, which is a good label for the kind of hidden serving gain competitors hate to hand away.

The optimization claim

The core claim is narrow and concrete. OpenAI reportedly found newly discovered inference optimizations that more than halved the cost of running existing models, not a next-model jump and not a training breakthrough.

That wording matters because it points at serving stack work. rohanpaul_ai framed the obvious suspects as quantization, KV-cache changes, batching, speculative decoding, and cheaper-model routing, but no source in the evidence pool identifies which of those actually shipped.

A couple hundred GPUs for logged-out ChatGPT

The most surprising number in the report is the serving footprint for anonymous ChatGPT traffic. The screenshot says OpenAI applied the technique to visitors without free or paid accounts and got the required GPU count down to "just a couple hundred" Nvidia GPUs at one point.

That number comes with an important limiter in the same screenshot: logged-out users likely do not account for much usage because OpenAI restricts how much they can do that way. The point is not that all of ChatGPT now runs on a tiny cluster. The point is that OpenAI apparently found a way to make one low-priority serving tier dramatically cheaper.

What people think changed

No one in the evidence pool has the mechanism. The discussion converges on a short list of plausible inference-side levers:

  • Quantization
  • KV-cache improvements
  • Better batching
  • Speculative decoding
  • Routing simpler requests to cheaper models or cheaper modes

rohanpaul_ai lists that full set explicitly, while The Rundown AI connects the story to Anthropic CEO Dario Amodei's internal term, "compute multipliers," for closely held efficiency gains. That is Christmas-come-early language for infra teams because it implies a real serving advantage can arrive without any visible model release.

Cost cuts do not automatically show up as cheaper speed

Lower backend cost and premium low-latency pricing are different layers. In a reply, kimmonismus noted that OpenAI charges up to 2.5x for Priority, while Anthropic Fast Mode can reach 6x, so a lab can get materially cheaper per-query infrastructure and still keep fast-path access as a premium SKU.

That caveat keeps the story grounded. A big inference win can improve margin, usage limits, and capacity planning without immediately turning into lower sticker prices for developers.

The margin math behind the report

The clearest non-technical read in the evidence is that efficiency is becoming part of the product moat. kimmonismus's main post said OpenAI ended Q1 with a 39% gross margin and is targeting 52% by year-end, while rohanpaul_ai added that adjusted gross margin had fallen to 33% in 2025 after inference costs quadrupled.

That context explains why several reactions immediately jumped from systems internals to competition. kimmonismus's reply argued OpenAI's moat is increasingly cost and inference, and daniel_mac8 made the broader point that the smartest model does not automatically win if another lab can serve demand much more efficiently.

Share on X