Skip to content
AI Primer
breaking

Arena launches Agent Mode rankings with GPT-5.5 High leading

Arena shipped Agent Mode, a benchmark that lets models use web search, bash, file writing, image generation, and follow-up questions, then ranks them on five live-session signals. It matters because agent evals move from static task sets to real user workflows, with GPT-5.5 High currently leading the leaderboard.

5 min read
Arena launches Agent Mode rankings with GPT-5.5 High leading
Arena launches Agent Mode rankings with GPT-5.5 High leading

TL;DR

  • arena's launch post introduced Agent Mode as a tool-using workflow on top of Arena, with web search, sandboxed bash, file writing, image generation, and follow-up questions for multi-step tasks.
  • According to arena's Agent Arena thread, the companion leaderboard ranks models on five live-session signals, task success, steerability, error recovery, praise versus complaint, and tool hallucination.
  • arena's leaderboard snapshot put GPT-5.5 High at No. 1, Claude Opus 4.7 Thinking at No. 2, GLM-5.1 at No. 3, Gemini 3.1 Pro at No. 4, and Kimi K2.6 at No. 5.
  • In the Agent Mode blog post, Arena said coding tasks make up 29% of usage, while research and planning each account for 11%.
  • The methodology post says the first leaderboard is built from 300,000 plus tasks, 2 million plus tool calls, and 40 million lines of code written by agents, using a causal-tracing framework instead of Arena's usual pairwise voting.

You can try Agent Mode, browse the agent leaderboard, and read Arena's two launch posts, one on the product and one on the ranking method. The useful weird bit lives in the methodology post: Arena says 32% of recent sessions ended with at least 128K tokens in the final turn, 8% topped 1 million, and 17% of sessions made 26 or more tool calls.

Agent Mode

Agent Mode is Arena's attempt to move its comparison site from chat outputs to end-to-end work. The launch post says the agent can research, generate images, build websites, debug code, write files, and ask clarifying questions inside one session instead of forcing users to re-prompt across separate tools arena's launch post.

The built-in tool set matches the product framing in the launch blog:

  • web search
  • sandboxed bash
  • file upload and file writing
  • image generation
  • coding and technical assistance
  • follow-up questioning

Arena positioned the mode as a shared testing surface for frontier labs. arena's model list named GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and open models at launch, and arena's Nemotron update added Nemotron 3 Ultra later the same day.

Agent Arena leaderboard

The ranking layer is the more interesting ship. Instead of Arena's familiar head-to-head votes, the methodology post says Agent Arena treats each agent session as a multi-component experiment and estimates each model's "net improvement" with randomized component selection.

Arena's first five signals are listed directly in arena's overview thread and expanded in the methodology post:

  1. Confirmed success
  2. Praise versus complaint
  3. Steerability
  4. Bash recovery
  5. Tool hallucination

arena's top-five ranking showed the initial top of the board:

  • GPT-5.5 High
  • Claude Opus 4.7 Thinking
  • GLM-5.1
  • Gemini 3.1 Pro
  • Kimi K2.6

The leaderboard page also exposes the aggregate "net improvement" numbers behind that ordering. In the current snapshot, GPT-5.5 High is at 10.66% plus or minus 1.60%, while Claude Opus 4.7 Thinking is at 9.47% plus or minus 1.50%.

Usage traces

Arena is betting that the raw session stream is the benchmark. petergostev's post said users do not know which model completed a task, while the FAQ section of the launch post says the leaderboard is built from live behavioral signals such as turn-by-turn feedback, explicit approve or disapprove labels, artifact downloads, and other user behavior from hundreds of thousands of sessions.

The product blog and methodology post add a few concrete usage slices:

  • Coding is 29% of the task mix, per the launch post.
  • A 7-day slice in the methodology post counted 160,480 tasks across 128,244 sessions.
  • 75.6% of sessions used at least one tool, 41.1% ran bash, and 27.1% ran web search, per the methodology post.
  • Agent Mode logged about 936,000 bash calls, about 550,000 file writes, and about 275,000 web searches in that same window, per the methodology post.
  • Arena says users more often tighten control than loosen it on follow-up turns, treating the model more like an employee than a fully autonomous system, in the launch post.

That last point is the most grounded counterweight to the usual agent-demo hype. Even on a product built for delegation, Arena says the dominant pattern is supervised delegation, not users handing over the keyboard.

Session scale

The methodology post reads like a quiet benchmark paper hiding inside a launch thread. It says Agent Mode wrote 40.3 million lines of code in the last week, averaged about 16.5 structured tool calls per session, and saw more than 3,400 loop-filtered high-tool sessions in a single week the methodology post.

Two numbers stand out because they say more about harness stress than model branding. In the methodology post, Arena says 17% of sessions made 26 or more tool calls, and 8% of sessions exceeded 1 million input tokens on the final turn.

Arena also used the launch to anchor this new benchmark against its older one. arena's community post said Battle Mode just passed 50 million votes, which helps explain why the company is now trying to turn its comparison traffic into a live agent-eval dataset instead of another static leaderboard.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
Agent Mode1 post
Share on X