breakingMay 11, 2026

Artificial Analysis launches Coding Agent Index: Cursor plus Opus 4.7 scores 61, Codex plus GPT-5.5 60

Artificial Analysis launched a Coding Agent Index for model-and-harness pairs, while OpenHands refreshed its model leaderboard. The results show harness choice matters, with cost varying over 30x and task time over 7x across stacks.

5 min read

Artificial Analysis launches Coding Agent Index: Cursor plus Opus 4.7 scores 61, Codex plus GPT-5.5 60

TL;DR

ArtificialAnlys' launch thread introduced a coding-agent benchmark that scores model-and-harness pairs together, not models in isolation, across SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA.
According to ArtificialAnlys' results, Cursor CLI plus Opus 4.7 led at 61, while Codex plus GPT-5.5 and Claude Code plus Opus 4.7 tied at 60.
ArtificialAnlys' cost breakdown found a bigger spread in operating characteristics than in top-line scores: cost per task varied by more than 30x, token usage by more than 3x, and time per task by more than 7x.
In OpenHandsDev's trend snapshot, OpenHands also put Opus 4.7 at the top of its broader software-engineering index, with GPT-5.5 close behind and cheaper models like Gemini 3.1 Pro and DeepSeek-3.2-Thinker sitting further down the cost curve.
OpenRouter's Pareto Code launch shows the immediate downstream use case for benchmark feeds like this one: routing requests to the cheapest coding model that clears a target score.

You can browse the Artificial Analysis benchmark page, compare it with the OpenHands Index, and see how quickly these rankings turn into product logic in OpenRouter's Pareto Code router. The fun detail is that the headline gap is tiny, but the systems underneath it are not: teortaxesTex's repost of the cost charts highlights how wildly token usage and cache behavior diverge across stacks, while OpenHandsDev's breakdown says Claude still looks strongest on long-horizon builds and software testing.

Coding Agent Index

Artificial Analysis is benchmarking the full stack, model plus harness, instead of pretending the shell around the model is incidental. Its composite score averages pass@1 across three tasks: SWE-Bench-Pro-Hard-AA for issue resolution, Terminal-Bench v2 for terminal work, and SWE-Atlas-QnA for codebase investigation and text answers.

That framing is the useful part. Engineers do not buy "GPT-5.5" or "Opus 4.7" in the abstract, they buy Codex, Cursor CLI, Claude Code, or some internal harness that shapes prompts, tool calls, retries, and caching.

Leaderboard

The first table is tightly packed at the top:

Cursor CLI + Opus 4.7: 61
Codex + GPT-5.5: 60
Claude Code + Opus 4.7: 60
Cursor CLI + GPT-5.5: 58
Claude Code + GLM-5.1: 53
Claude Code + Kimi K2.6: 50
Claude Code + DeepSeek V4 Pro: 50
Claude Code + Sonnet 4.6: 49
Cursor CLI + Composer 2: 48
Gemini CLI + Gemini 3.1 Pro: 43

ArtificialAnlys' benchmark-by-benchmark note adds that the tie at the top hides different strengths. GPT-5.5 in Codex led on SWE-Atlas-QnA and Terminal-Bench v2, while Opus 4.7 in Claude Code led on SWE-Bench-Pro-Hard-AA.

Cost and latency

The score spread from first to last is 18 points. The operating spread is much larger:

Cost per task: $0.07 for Cursor CLI + Composer 2, up to $2.26 for Claude Code + GLM-5.1, per ArtificialAnlys' launch thread
Token usage: 1.5M for Cursor CLI + Opus 4.7, up to 4.8M for Claude Code + GLM-5.1, according to the reposted chart
Cache hit rate: 80% to 96%, per ArtificialAnlys' launch thread
Time per task: about 5.8 minutes for Claude Code + Opus 4.7, up to 41.5 minutes for Claude Code + Kimi K2.6, as shown in

Artificial Analysis said GLM-5.1's high token burn was partly driven by looping on some tasks, while Kimi K2.6 posted the highest turn count and the slowest average completion time ArtificialAnlys' analysis. That is the part benchmark leaderboards usually flatten away.

OpenHands cross-check

OpenHands is measuring a different surface, five software-engineering tasks across more than 20 models, but its refresh points in roughly the same direction. OpenHandsDev's longer thread said Opus 4.7 took the top overall spot, GPT-5.5 stayed close, Gemini 3.1 Pro looked cheaper, and DeepSeek-3.2-Thinker came in at about one-tenth Claude's price.

The task breakdown is more revealing than the ranking. According to OpenHandsDev's task-by-task summary:

SWE-Bench issue resolution was crowded at the top.
Greenfield, long-horizon development favored Claude and GPT.
Frontend work favored Claude and Gemini, with Opus 4.7 standing out.
GLM-5.1 was unusually competitive on frontend tasks for an open model.
Software testing was Claude's strongest category.
Information gathering favored GPT.

OpenHandsDev's speed note also credits Claude and GPT with faster issue resolution because of effective parallel tool calling and efficient inference. That lines up with the new Artificial Analysis index treating harness behavior as first-order benchmark data, not implementation detail.

Routing products

OpenRouter already turned Artificial Analysis rankings into a router. Pareto Code lets users set min_coding_score and sends the request to the cheapest coding model above that threshold, while OpenRouter's lineup post says the launch started with 13 models and up to 2M context.

OpenRouter's Nitro post adds a second mode that re-ranks models within a tier by throughput instead of price. That is a small but telling coda to the benchmark week: once rankings start tracking cost, latency, and score together, they stop looking like marketing charts and start looking like routing tables.