Artificial Analysis introduced AA-AgentPerf to benchmark hardware on real coding-agent traces instead of synthetic chat prompts. The benchmark reports users per accelerator, kW, dollar, and rack, so teams can compare production cost and throughput more realistically.

AA-AgentPerf is trying to benchmark inference hardware against agent-style coding sessions instead of single-turn chat loads. Artificial Analysis says the test set uses real trajectories with long-running interactions, large contexts, and production serving tricks turned on, which makes the benchmark closer to how coding agents are actually deployed than classic prompt-throughput charts launch post.
The key metric is not just raw tokens per second. Artificial Analysis's follow-up post says results will be expressed as the maximum number of concurrent users a system can sustain at a given per-user output speed, then broken out as users per accelerator, per kW TDP, per rental dollar, and per rack. That framing makes the benchmark more directly useful for teams choosing hardware under power, space, or cost constraints. The launch also includes a public benchmark page and a separate methodology writeup via the benchmark page and the methodology, with initial model coverage limited to gpt-oss-120b and DeepSeek V3.2.
Introducing AA-AgentPerf - the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production Show more
We expect initial results to be available within the next 1-2 weeks, after submissions from hardware providers and QA from our team. Results will be visible at artificialanalysis.ai/benchmarks/har… Get to know the evaluation methodology more closely at artificialanalysis.ai/methodology/ag…