Skip to content
AI Primer
update

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

Vercel's Next.js evals place Composer 2 second, ahead of Opus and Gemini despite the recent Kimi-base controversy. The result matters because it separates base-model branding from measured task performance on a real framework workflow.

3 min read
Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini
Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

TL;DR

  • Vercel's Next.js eval post says Cursor Composer 2 is now second on its Next.js agent leaderboard, ahead of both Opus and Gemini, which gives Cursor a framework-specific result that is stronger than the current base-model discourse suggests.
  • The underlying eval page describes task-based code generation and migration tests, and its summary puts Composer 2 at a 76% success rate, making this a concrete workflow benchmark rather than a generic chatbot ranking Vercel benchmark.
  • That outcome cuts against the recent Kimi-base criticism arguing Composer 2 should underperform because it was built on Kimi K2.5, a model the post places at #14 on LMArena Code.
  • Early practitioner usage already looks hybrid: one developer's weekend workflow uses GPT-5.4 xhigh for planning and audit, with Composer 2 doing implementation in the middle.

What did the eval actually show?

Vercel's Next.js evals page frames this as agent performance on Next.js code generation and migration tasks, not a broad consumer-model leaderboard. In that setting, Composer 2 takes second place and, per the page summary, lands at a 76% success rate while beating both Opus and Gemini Vercel benchmark.

That matters because the benchmark is tied to a real framework workflow engineers already care about: shipping and updating Next.js apps. The result was also quickly amplified beyond Vercel's original post in a repost, which helped turn a product release into a public comparison point for coding agents.

Why the Kimi-base argument didn't settle the question

The main pushback came from posts arguing that Cursor made the wrong foundation-model choice. In one widely shared example, the critique says Composer 2 was built on Kimi K2.5 and highlights a screenshot where Kimi sits at #14 on a code arena leaderboard, behind Claude, GPT-5.4, Gemini 3.1 Pro, GLM-5, and MiniMax.

But Vercel's result is a reminder that base-model rank and agent rank are not the same thing. A coding agent is a full system: prompting, planning, tool use, edit strategy, and product UX all affect outcome. Cursor itself has been leaning into that system view; in its Glass teaser, the company describes the experience as "still early" but "clearer now," pointing to a more controlled desktop interface for working with agents.

The gap between those two signals is the real story here. Composer 2 can be built on a debated base model and still score near the top on a framework-specific eval if the surrounding agent stack is good enough eval result.

How engineers are starting to use it

The early workflow evidence is less about replacing frontier models outright and more about specialization. One practitioner's usage note is blunt: "gpt 5.4 xhigh to plan," then "cursor composer 2 to implement," then back to GPT-5.4 to "audit + fix" before shipping a pull request.

That pattern matches the benchmark story. Composer 2 is showing up as an implementation engine inside a multi-model loop, not necessarily as the only model in the stack. The missing piece, according to an API request, is programmability: users already want Composer 2 exposed through something like OpenRouter so they can plug it into their own agents rather than keep it inside Cursor's product surface.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
What did the eval actually show?1 post
Why the Kimi-base argument didn't settle the question1 post
How engineers are starting to use it1 post
Share on X