Vercel's Next.js evals place Composer 2 second, ahead of Opus and Gemini despite the recent Kimi-base controversy. The result matters because it separates base-model branding from measured task performance on a real framework workflow.

Vercel's Next.js evals page frames this as agent performance on Next.js code generation and migration tasks, not a broad consumer-model leaderboard. In that setting, Composer 2 takes second place and, per the page summary, lands at a 76% success rate while beating both Opus and Gemini Vercel benchmark.
That matters because the benchmark is tied to a real framework workflow engineers already care about: shipping and updating Next.js apps. The result was also quickly amplified beyond Vercel's original post in a repost, which helped turn a product release into a public comparison point for coding agents.
The main pushback came from posts arguing that Cursor made the wrong foundation-model choice. In one widely shared example, the critique says Composer 2 was built on Kimi K2.5 and highlights a screenshot where Kimi sits at #14 on a code arena leaderboard, behind Claude, GPT-5.4, Gemini 3.1 Pro, GLM-5, and MiniMax.
But Vercel's result is a reminder that base-model rank and agent rank are not the same thing. A coding agent is a full system: prompting, planning, tool use, edit strategy, and product UX all affect outcome. Cursor itself has been leaning into that system view; in its Glass teaser, the company describes the experience as "still early" but "clearer now," pointing to a more controlled desktop interface for working with agents.
The gap between those two signals is the real story here. Composer 2 can be built on a debated base model and still score near the top on a framework-specific eval if the surrounding agent stack is good enough eval result.
The early workflow evidence is less about replacing frontier models outright and more about specialization. One practitioner's usage note is blunt: "gpt 5.4 xhigh to plan," then "cursor composer 2 to implement," then back to GPT-5.4 to "audit + fix" before shipping a pull request.
That pattern matches the benchmark story. Composer 2 is showing up as an implementation engine inside a multi-model loop, not necessarily as the only model in the stack. The missing piece, according to an API request, is programmability: users already want Composer 2 exposed through something like OpenRouter so they can plug it into their own agents rather than keep it inside Cursor's product surface.
Vercel Emulate added a programmatic API for creating, resetting, and closing local GitHub, Vercel, and Google emulators inside automated tests. That makes deterministic integration tests easier to wire into CI and agent loops without manual setup.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Cursor's Composer 2 just took second place on the Next.js evals leaderboard, beating both Opus and Gemini. See the full rankings ↓ vercel.fyi/next-composer2
Cursor built Composer 2 on top of Kimi K2.5. Kimi K2.5 ranks #14 on LMArena Code with 1431 Elo. Behind Claude Opus 4.6. Behind Claude Sonnet 4.6. Behind GPT 5.4. Behind Gemini 3.1 Pro. Behind GLM-5. Behind MiniMax M2.7. You're telling me Cursor picked the #14 ranked Show more
new workflow for the weekend: - gpt 5.4 xhigh to plan - cursor composer 2 to implement - back to 5.4 xhigh to audit + fix - ship pull request - repeat