updateMarch 21, 2026

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

Vercel's Next.js evals place Composer 2 second, ahead of Opus and Gemini despite the recent Kimi-base controversy. The result matters because it separates base-model branding from measured task performance on a real framework workflow.

Cursor Coding Agents Benchmarks Developer Experience

3 min read

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

TL;DR

Vercel's Next.js eval post says Cursor Composer 2 is now second on its Next.js agent leaderboard, ahead of both Opus and Gemini, which gives Cursor a framework-specific result that is stronger than the current base-model discourse suggests.
The underlying eval page describes task-based code generation and migration tests, and its summary puts Composer 2 at a 76% success rate, making this a concrete workflow benchmark rather than a generic chatbot ranking Vercel benchmark.
That outcome cuts against the recent Kimi-base criticism arguing Composer 2 should underperform because it was built on Kimi K2.5, a model the post places at #14 on LMArena Code.
Early practitioner usage already looks hybrid: one developer's weekend workflow uses GPT-5.4 xhigh for planning and audit, with Composer 2 doing implementation in the middle.

What did the eval actually show?

Vercel's Next.js evals page frames this as agent performance on Next.js code generation and migration tasks, not a broad consumer-model leaderboard. In that setting, Composer 2 takes second place and, per the page summary, lands at a 76% success rate while beating both Opus and Gemini Vercel benchmark.

That matters because the benchmark is tied to a real framework workflow engineers already care about: shipping and updating Next.js apps. The result was also quickly amplified beyond Vercel's original post in a repost, which helped turn a product release into a public comparison point for coding agents.

Why the Kimi-base argument didn't settle the question

The main pushback came from posts arguing that Cursor made the wrong foundation-model choice. In one widely shared example, the critique says Composer 2 was built on Kimi K2.5 and highlights a screenshot where Kimi sits at #14 on a code arena leaderboard, behind Claude, GPT-5.4, Gemini 3.1 Pro, GLM-5, and MiniMax.

But Vercel's result is a reminder that base-model rank and agent rank are not the same thing. A coding agent is a full system: prompting, planning, tool use, edit strategy, and product UX all affect outcome. Cursor itself has been leaning into that system view; in its Glass teaser, the company describes the experience as "still early" but "clearer now," pointing to a more controlled desktop interface for working with agents.

The gap between those two signals is the real story here. Composer 2 can be built on a debated base model and still score near the top on a framework-specific eval if the surrounding agent stack is good enough eval result.

How engineers are starting to use it

The early workflow evidence is less about replacing frontier models outright and more about specialization. One practitioner's usage note is blunt: "gpt 5.4 xhigh to plan," then "cursor composer 2 to implement," then back to GPT-5.4 to "audit + fix" before shipping a pull request.

That pattern matches the benchmark story. Composer 2 is showing up as an implementation engine inside a multi-model loop, not necessarily as the only model in the stack. The missing piece, according to an API request, is programmability: users already want Composer 2 exposed through something like OpenRouter so they can plug it into their own agents rather than keep it inside Cursor's product surface.

🧾 More sources

What did the eval actually show?1 tweets

Evidence covering the Vercel benchmark itself and how the result circulated.

Why the Kimi-base argument didn't settle the question1 tweets

Evidence on the criticism of the Kimi foundation model and Cursor's own framing of its agent interface.

How engineers are starting to use it1 tweets

Practitioner workflow evidence and the demand for an API surface beyond Cursor's app.

updateMarch 21, 2026

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

Cursor Coding Agents Benchmarks Developer Experience

3 min read

TL;DR

Vercel's Next.js eval post says Cursor Composer 2 is now second on its Next.js agent leaderboard, ahead of both Opus and Gemini, which gives Cursor a framework-specific result that is stronger than the current base-model discourse suggests.
The underlying eval page describes task-based code generation and migration tests, and its summary puts Composer 2 at a 76% success rate, making this a concrete workflow benchmark rather than a generic chatbot ranking Vercel benchmark.
That outcome cuts against the recent Kimi-base criticism arguing Composer 2 should underperform because it was built on Kimi K2.5, a model the post places at #14 on LMArena Code.
Early practitioner usage already looks hybrid: one developer's weekend workflow uses GPT-5.4 xhigh for planning and audit, with Composer 2 doing implementation in the middle.

What did the eval actually show?

Next.js

@nextjs

·Follow

Cursor's Composer 2 just took second place on the Next.js evals leaderboard, beating both Opus and Gemini. See the full rankings ↓ vercel.fyi/next-composer2

4:42 PM · Mar 21, 2026

971

Read 35 replies

Why the Kimi-base argument didn't settle the question

BridgeMind

@bridgemindai

·Follow

Cursor built Composer 2 on top of Kimi K2.5. Kimi K2.5 ranks #14 on LMArena Code with 1431 Elo. Behind Claude Opus 4.6. Behind Claude Sonnet 4.6. Behind GPT 5.4. Behind Gemini 3.1 Pro. Behind GLM-5. Behind MiniMax M2.7. You're telling me Cursor picked the #14 ranked Show more

1:57 PM · Mar 21, 2026

200

Read 55 replies

How engineers are starting to use it

Ian Nuttall

@iannuttall

·Follow

new workflow for the weekend: - gpt 5.4 xhigh to plan - cursor composer 2 to implement - back to 5.4 xhigh to audit + fix - ship pull request - repeat

12:33 PM · Mar 21, 2026

303

Read 44 replies

🧾 More sources

What did the eval actually show?1 tweets

Evidence covering the Vercel benchmark itself and how the result circulated.

Why the Kimi-base argument didn't settle the question1 tweets

Evidence on the criticism of the Kimi foundation model and Cursor's own framing of its agent interface.

How engineers are starting to use it1 tweets

Practitioner workflow evidence and the demand for an API surface beyond Cursor's app.