Artificial Analysis launches AA-AgentPerf for 200-turn, 100K-token coding traces
Artificial Analysis introduced AA-AgentPerf to benchmark hardware on real coding-agent traces instead of synthetic chat prompts. The benchmark reports users per accelerator, kW, dollar, and rack, so teams can compare production cost and throughput more realistically.

TL;DR
- Artificial Analysis launched AA-AgentPerf as a new hardware benchmark built around real coding-agent traces rather than short synthetic prompts; its launch post says the captured workloads run for "up to 200 turns" and exceed "100K tokens" in context length launch post.
- The benchmark allows deployment-side optimizations that matter in production, including "KV cache reuse," disaggregated prefill/decode, and speculative decoding, according to launch post.
- AA-AgentPerf reports maximum concurrent users at target output speeds, then normalizes those results per accelerator, per kW TDP, per dollar per hour, and per rack, as Artificial Analysis explains in results thread.
- Submissions are open now, launch support is limited to gpt-oss-120b and DeepSeek V3.2, and Artificial Analysis says on the follow-up post that first results should land in 1-2 weeks after provider submissions and QA.
What exactly is AA-AgentPerf measuring?
AA-AgentPerf is trying to benchmark inference hardware against agent-style coding sessions instead of single-turn chat loads. Artificial Analysis says the test set uses real trajectories with long-running interactions, large contexts, and production serving tricks turned on, which makes the benchmark closer to how coding agents are actually deployed than classic prompt-throughput charts launch post.
The key metric is not just raw tokens per second. Artificial Analysis's follow-up post says results will be expressed as the maximum number of concurrent users a system can sustain at a given per-user output speed, then broken out as users per accelerator, per kW TDP, per rental dollar, and per rack. That framing makes the benchmark more directly useful for teams choosing hardware under power, space, or cost constraints. The launch also includes a public benchmark page and a separate methodology writeup via the benchmark page and the methodology, with initial model coverage limited to gpt-oss-120b and DeepSeek V3.2.