breakingMarch 12, 2026

Cursor publishes CursorBench to compare coding models on intelligence and token efficiency

Cursor published its internal benchmarking approach and reported wider separation between coding models than SWE-bench-style leaderboards show. Use it as a reference for production routing decisions, but validate results against your own online traffic and task mix.

Cursor Coding Agents Evals Benchmarks

3 min read

Cursor publishes CursorBench to compare coding models on intelligence and token efficiency

TL;DR

Cursor published a new scoring method for agentic coding tasks and said it combines offline benchmarks with online evals so results stay useful as public coding benchmarks get saturated. The blog post frames it as a benchmark built from real Cursor sessions rather than a pure leaderboard exercise.
In CursorBench's comparison chart, model gaps widen versus SWE-bench Verified: several models cluster around 75-81 on SWE-bench, but spread from roughly 29 to 58 on CursorBench, suggesting stronger discrimination on multi-step coding work.
Cursor's efficiency plot puts score and token usage on the same axis: GPT-5.4 appears on the efficiency frontier around 16k tokens, while higher-scoring settings like GPT-5.3 Codex xhigh use substantially more tokens.
Cursor says in