Artificial Analysis launches AA-WER Streaming with Cartesia Ink-2 at 3.7% WER
Artificial Analysis launched AA-WER Streaming to benchmark streaming speech-to-text models on accuracy and latency for voice agents. The first leaderboard puts Cartesia Ink-2 and ElevenLabs Scribe v2 on the price-latency frontier, so teams should compare cost against latency before choosing a model.

TL;DR
- ArtificialAnlys' launch thread introduced AA-WER Streaming as a new benchmark for streaming speech-to-text, pairing word error rate with post-speech latency on voice-agent style workloads.
- According to ArtificialAnlys' results summary, Cartesia Ink-2 with semantic endpoints posted the best final-transcript accuracy at 3.59% WER and 0.21s latency, while ElevenLabs Scribe v2 Realtime was close behind at 3.64% WER and 0.14s.
- ArtificialAnlys' latency breakdown put Deepgram Flux on the speed extreme at 0.020s final latency and 0.019s partial latency, both at 7.36% WER, which makes the leaderboard more about tradeoffs than a single winner.
- On price-adjusted final accuracy, ArtificialAnlys' pricing post placed Cartesia Ink-2 external endpoints at $4 per 1,000 minutes with 3.7% WER and ElevenLabs Scribe v2 Realtime at $6.50 with 3.6% WER and 0.14s latency.
You can jump straight to [the full leaderboard](AA-WER Streaming leaderboard) and [the methodology page](AA-WER methodology). The benchmark splits results into first final transcript and first partial transcript, which is more useful than a single latency number for teams building turn-taking voice agents. The early leaderboard is also already surfacing product-specific design choices, with Cartesia's announcement and Cartesia's follow-up both leaning on semantic endpointing and eager transcripts rather than raw WER alone.
Accuracy and latency
AA-WER Streaming measures two paired outcomes after end of speech is detected: the first final transcript and the first transcript-bearing event, which can be partial or final. That gives one view for high-confidence transcription and another for near-instant response paths, according to ArtificialAnlys' benchmark launch and [the methodology page](AA-WER methodology).
Artificial Analysis tested on roughly eight hours of audio, reusing the same AA-AgentTalk, Earnings22-Cleaned-AA, and VoxPopuli-Cleaned-AA sets from its offline AA-WER v2.0 benchmark, as ArtificialAnlys' benchmark launch explains.
The first leaderboard breaks into three obvious clusters:
- Best final-transcript accuracy: Cartesia Ink-2 with semantic endpoints at 3.59% WER and 0.21s, then ElevenLabs Scribe v2 Realtime at 3.64% and 0.14s, then Cartesia Ink-2 with external endpoints at 3.66% and 0.09s, per ArtificialAnlys' results summary.
- Best first-partial accuracy: ElevenLabs Scribe v2 Realtime at 3.65% WER and 0.13s, then Cartesia Ink-2 external endpoints at 4.33% and 0.07s, then AssemblyAI U3 Realtime Pro at 4.46% and 0.47s, per ArtificialAnlys' partial-results summary.
- Fastest transcripts: Deepgram Flux at 0.020s final and 0.019s partial, both at 7.36% WER, with Soniox, Deepgram Nova-3 Realtime, and NVIDIA Nemotron 3 ASR 80ms following on latency, according to ArtificialAnlys' latency breakdown.
Pricing frontier
The pricing spread is wide enough to matter on its own. ArtificialAnlys' pricing comparison puts current streaming STT prices between $2 and $17 per 1,000 minutes.
Artificial Analysis highlighted two models on the price versus final-accuracy frontier:
- Cartesia Ink-2, external endpoints: $4 per 1,000 minutes, 3.7% WER, per ArtificialAnlys' pricing comparison
- ElevenLabs Scribe v2 Realtime: $6.50 per 1,000 minutes, 3.6% WER and 0.14s latency, per ArtificialAnlys' pricing comparison
- Soniox: $2 per 1,000 minutes and 11.9% WER, which makes it the cheapest listed option but also the least accurate in that comparison, per ArtificialAnlys' pricing comparison
That framing is the useful part of the new board. The top-ranked model on one latency slice is not automatically the cleanest buy once price enters the chart.
Methodology
The benchmark measures latency from the moment Silero VAD detects end of speech, not from the start of the utterance. ArtificialAnlys' methodology summary says WER and latency are then recorded at the first final-denoted transcript and at the first transcript-bearing event after that endpoint.
That makes the board unusually specific to live voice-agent turn taking. Artificial Analysis explicitly ties the partial-transcript view to fast yes-or-no style reactions and speculative decoding, while the final-transcript view is positioned as the better standalone accuracy measure, according to ArtificialAnlys' methodology summary.
For readers who want the raw tables instead of the thread summary, the leaderboard and the methodology page are both public.
Semantic endpointing
Cartesia is already using the benchmark to argue for a specific streaming STT design stack. In Cartesia's design thread, the company says production-grade voice agents need three things together: strong accuracy on noisy or awkward inputs, low latency with eager transcripts, and semantic endpointing that helps the system avoid interrupting users.
Karan Goel's Ink-2 post makes the same pitch more bluntly, saying Ink-2 was built for streaming with fast eager mode and built-in semantic endpoints, and attributing its position on the Pareto frontier to new architectures and algorithms.
That is a distinct signal from the benchmark launch itself. The leaderboard is measuring post-speech speed and WER, but vendors are already treating endpointing behavior as part of the product story around voice-agent quality.