Vercel's Next.js evals place Composer 2 second, ahead of Opus and Gemini despite the recent Kimi-base controversy. The result matters because it separates base-model branding from measured task performance on a real framework workflow.

Vercel's Next.js evals page frames this as agent performance on Next.js code generation and migration tasks, not a broad consumer-model leaderboard. In that setting, Composer 2 takes second place and, per the page summary, lands at a 76% success rate while beating both Opus and Gemini Vercel benchmark.
That matters because the benchmark is tied to a real framework workflow engineers already care about: shipping and updating Next.js apps. The result was also quickly amplified beyond Vercel's original post in a repost, which helped turn a product release into a public comparison point for coding agents.
The main pushback came from posts arguing that Cursor made the wrong foundation-model choice. In one widely shared example, the critique says Composer 2 was built on Kimi K2.5 and highlights a screenshot where Kimi sits at #14 on a code arena leaderboard, behind Claude, GPT-5.4, Gemini 3.1 Pro, GLM-5, and MiniMax.
But Vercel's result is a reminder that base-model rank and agent rank are not the same thing. A coding agent is a full system: prompting, planning, tool use, edit strategy, and product UX all affect outcome. Cursor itself has been leaning into that system view; in its Glass teaser, the company describes the experience as "still early" but "clearer now," pointing to a more controlled desktop interface for working with agents.
The gap between those two signals is the real story here. Composer 2 can be built on a debated base model and still score near the top on a framework-specific eval if the surrounding agent stack is good enough eval result.
The early workflow evidence is less about replacing frontier models outright and more about specialization. One practitioner's usage note is blunt: "gpt 5.4 xhigh to plan," then "cursor composer 2 to implement," then back to GPT-5.4 to "audit + fix" before shipping a pull request.
That pattern matches the benchmark story. Composer 2 is showing up as an implementation engine inside a multi-model loop, not necessarily as the only model in the stack. The missing piece, according to an API request, is programmability: users already want Composer 2 exposed through something like OpenRouter so they can plug it into their own agents rather than keep it inside Cursor's product surface.
Cursor's Composer 2 just took second place on the Next.js evals leaderboard, beating both Opus and Gemini. See the full rankings ↓ vercel.fyi/next-composer2
Cursor built Composer 2 on top of Kimi K2.5. Kimi K2.5 ranks #14 on LMArena Code with 1431 Elo. Behind Claude Opus 4.6. Behind Claude Sonnet 4.6. Behind GPT 5.4. Behind Gemini 3.1 Pro. Behind GLM-5. Behind MiniMax M2.7. You're telling me Cursor picked the #14 ranked Show more
new workflow for the weekend: - gpt 5.4 xhigh to plan - cursor composer 2 to implement - back to 5.4 xhigh to audit + fix - ship pull request - repeat