Skip to content
AI Primer
release

OpenAI adds GPT-Realtime-2, Translate, and Whisper to the Realtime API

OpenAI added GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper to the Realtime API. The update gives voice agents live reasoning, translation, and transcription, but it remains API-only rather than part of ChatGPT voice mode.

6 min read
OpenAI adds GPT-Realtime-2, Translate, and Whisper to the Realtime API
OpenAI adds GPT-Realtime-2, Translate, and Whisper to the Realtime API

TL;DR

  • OpenAI added three new audio models to the Realtime API, with OpenAI's launch thread naming GPT-Realtime-2 for speech-to-speech agents, GPT-Realtime-Translate for live translation, and GPT-Realtime-Whisper for streaming transcription.
  • According to OpenAIDevs' model overview, GPT-Realtime-2 upgrades the context window from 32K to 128K and improves tool use, recovery behavior, domain language handling, and tone control during live conversations.
  • Benchmarks in OpenAIDevs' charts put GPT-Realtime-2 at 96.6% on Big Bench Audio and 48.5% on Audio MultiChallenge instruction following, while Artificial Analysis said the minimal reasoning setting also led its Conversational Dynamics benchmark at 96.1%.
  • The ship is API-only for now: OpenAI's ChatGPT reply teased voice updates for ChatGPT later, and Simon Willison's reaction noted the current ChatGPT voice experience did not get the new model on launch day.
  • Early hands-on posts from petergostev's live translator demo, juberti's hello-realtime update, and kwindla's agent test focused less on chatbot voice and more on tool-using agents, live dubbing, and longer-running voice workflows.

You can jump from the official announcement to the Realtime API guide, then straight into OpenAI's new prompting guide for GPT-Realtime-2. The most useful non-launch detail came from Artificial Analysis' model breakdown, which added latency and pricing context, while petergostev's GitHub demo and hello-realtime.val.run showed what people started building within hours.

GPT-Realtime-2

OpenAI is positioning GPT-Realtime-2 as the first realtime voice model in its stack with reasoning, not just turn-by-turn speech synthesis. In the official post, the company says the model is built for agents that can keep talking while they use tools and work through harder requests.

The feature list in OpenAIDevs' thread is the important part for engineers:

  • better performance on harder requests
  • more reliable tool use
  • stronger recovery behavior after interruptions
  • improved handling of domain-specific language
  • better tone control while the conversation is still live
  • 128K context, up from 32K

A model card screenshot in bridgemindai's Playground capture adds a few specs that did not make the top-line tweet: text, audio, and image input; text and audio output; 32K max output tokens; reasoning token support; and a September 30, 2024 knowledge cutoff.

GPT-Realtime-Translate

GPT-Realtime-Translate is the cleanest new capability in the drop. OpenAI's announcement described it as streaming translation for more than 70 input languages and 13 output languages, and OpenAIDevs' follow-up framed it around live multilingual conversations rather than turn-based translation.

The first wave of demos immediately went beyond voice chat. petergostev's demo wired the model into a Chrome extension that overlays translated speech on YouTube videos, and the repo is public at Live-YT-Translator. RayFernando1337's build stream showed a separate live-translator build session, which is a fast sign that the API surface is straightforward enough for same-day hacking.

One interesting loose detail came from petergostev's follow-up, which said an OpenAI event demo described training on UN synchronized translators. That claim is not in the official launch material, but it does help explain why the product emphasis is on staying in sync with a speaker instead of waiting for clean sentence boundaries.

GPT-Realtime-Whisper

GPT-Realtime-Whisper is less flashy, but probably the most immediately reusable model in the set. OpenAIDevs' post pitched it as low-latency streaming transcription for apps that need to understand speech continuously while the interaction is still unfolding.

The demo linked from juberti's post at realtyper.val.run shows the intended use case in one screen: continuous transcription with no handoff to a separate batch ASR step. Later, juberti's delay-selector update added an explicit latency versus accuracy control, which is a useful tell about the product tradeoff OpenAI is exposing across the stack.

Benchmarks and latency

OpenAI's own charts in OpenAIDevs' benchmark post compare GPT-Realtime-2 with GPT-Realtime-1.5 on two metrics:

  • Big Bench Audio intelligence: 81.4% → 96.6%, up 15.2 points
  • Audio MultiChallenge instruction following: 34.7% → 48.5%, up 13.8 points

Artificial Analysis added the broader market picture in its benchmark thread. The high reasoning version matched Gemini 3.1 Flash Live Preview High at 96.6% on Big Bench Audio, while the minimal reasoning version led its Conversational Dynamics benchmark at 96.1%.

Pricing stayed flat, according to Artificial Analysis' pricing post, at $1.15 per hour of input audio and $4.61 per hour of output audio. That same thread also pegged time to first audio at 1.12 seconds for minimal reasoning and 2.33 seconds for high reasoning, which gives a more concrete picture of what OpenAI means by configurable reasoning effort.

A more interesting wrinkle came from kwindla's agent benchmark note, which said the "low" reasoning setting beat higher effort levels on a complex multi-turn agent benchmark. For voice agents, the fastest usable setting may end up being the best setting more often than the flagship chart implies.

API-only, not ChatGPT voice

The loudest confusion on launch day was distribution, not capability. OpenAI's reply said voice updates for ChatGPT are coming, but the launch itself was limited to the API and Playground.

That distinction mattered because a lot of people read the announcement as an Advanced Voice Mode refresh. Simon Willison immediately pointed out that the current ChatGPT voice experience had not changed yet, while Sam Altman separately said OpenAI is "working on improvements to voice in chat." The result is a familiar OpenAI pattern: the developer surface gets the real ship first, and the consumer surface gets the teaser.

Prompting guide

The most practically useful post-launch artifact may be the new GPT-Realtime-2 prompting guide. OpenAI says it covers six specific areas:

  • tuning reasoning effort
  • using short preambles before actions
  • designing tool behavior
  • handling unclear audio
  • capturing exact entities
  • maintaining state across longer sessions

Those topics line up almost perfectly with what early builders were already stress-testing. kwindla's agent write-up used the model with 25 tools, an 8K-token system prompt, context-sharing with subagents, and async context compression. Genspark's rollout note claimed a 26% gain in effective conversation rate and fewer dropped calls after putting GPT-Realtime-2 into its call agent. OpenAI shipped three audio models, but the docs make the real story explicit: this release is about voice agents that have to stay coherent while they call tools and keep talking.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR3 posts
GPT-Realtime-21 post
GPT-Realtime-Translate2 posts
GPT-Realtime-Whisper1 post
Benchmarks and latency2 posts
API-only, not ChatGPT voice2 posts
Prompting guide2 posts
Share on X