Research & Benchmarks — Explore AI Tools & Stories

Fresh stories

New

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

OpenHands introduced EvoClaw, a benchmark that reconstructs milestone DAGs from repo history to test continuous software evolution instead of isolated tasks. The first results show agents can clear single tasks yet still collapse under regressions and technical debt over longer runs.

🔬OpenHands1d ago

Breaking

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

LLM Debate Benchmark ran 1,162 side-swapped debates across 21 models and ranked Sonnet 4.6 first, ahead of GPT-5.4 high. It adds a stronger adversarial eval pattern for judge or debate systems, but you should still inspect content-block rates and judge selection when reading the leaderboard.

New

Claude·1d ago·3 min read

New

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

Vals AI switched SWE-Bench Verified from SWE-Agent to the bash-only mini-swe-agent harness, aligning results more closely with the official benchmark setup. Top score dipped slightly to 78.8%, but the change reduces harness-specific confounds when comparing models.

🔬Benchmarks1d ago

New1d ago

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

🔬OpenHands1d ago

New1d ago

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

🔬Claude1d ago

New1d ago

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

🔬Benchmarks1d ago

Top storiesthis week

See all →

Breaking

llm-circuit-finder compares duplicated layers and reports BBH logical deduction gains

The toolkit sweeps contiguous layer ranges in GGUF and llama.cpp-style setups to test whether duplicating them can unlock better reasoning without retraining. Treat the jump as a reproducible experiment, not a settled mechanism, because thread responses challenge whether the effect reflects circuits, routing, or training artifacts.

New

Interpretability·2d ago·3 min read

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

Vercel's Next.js evals place Composer 2 second, ahead of Opus and Gemini despite the recent Kimi-base controversy. The result matters because it separates base-model branding from measured task performance on a real framework workflow.

🔬Cursor3d ago

New

Autoresearch claims 2718 Elo after 70 experiments on a Rust chess engine

A developer says an autoresearch loop hill-climbed a vibecoded Rust engine to 2718 Elo after running more than 70 experiments under a 500 ms move budget. The real takeaway is the workflow: automated experiment loops can optimize code against a measurable target.

Workflow🔬Deep Research3d ago

New

Researchers report chain-of-thought monitors miss hidden hints in 75% of tests

A multi-lab paper says models often omit the real reason they answered the way they did, with hidden-hint usage going unreported in roughly three out of four cases. Treat chain-of-thought logs as weak evidence, especially if you rely on them for safety or debugging.

🔬Claude3d ago

New

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.

🔬Reinforcement Learning3d ago

New

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

OpenHands published a skill-eval recipe with bounded tasks, deterministic verifiers, and no-skill baselines, then showed some skills speed agents up while others make them brittle. Teams shipping skill libraries should measure them per task and model before rollout.

Workflow🔬OpenHands4d ago

New

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

LightOn’s late-interaction retriever paired with GPT-5 reached 87.59 accuracy on BrowseComp-Plus while using fewer search calls than larger baselines. It suggests deep-research quality may now hinge more on retrieval architecture than on swapping in ever larger LLMs.

🔬Search5d ago

New

OpenAI launches Parameter Golf with 16 MB models and 8xH100 training limit

OpenAI opened its first Model Craft challenge, asking participants to train the best language model that fits inside a 16 MB artifact and trains in under 10 minutes on eight H100s. Engineers get a concrete optimization target, an automated GitHub leaderboard, and a public benchmark for training-efficiency tricks.

🔬Benchmarks6d ago

New

Weights & Biases updates Models with synced robotics video playback and pinned baselines

W&B shipped robotics-focused evaluation views including synchronized video playback, pinned run baselines, semantic coloring, and side-by-side media comparisons. These tools matter if your model outputs are videos or trajectories and loss curves alone hide failure modes.

🔬Observability1w ago

New

Google DeepMind launches Kaggle benchmark contest with $200k to measure AGI capabilities

Google DeepMind and Kaggle opened a global challenge to build cognitive benchmarks across learning, metacognition, attention, executive function, and social cognition. Join if you work on evals and want reusable tasks with human baselines instead of another saturated leaderboard.

🔬Benchmarks1w ago

See all stories →

New

llm-circuit-finder compares duplicated layers and reports BBH logical deduction gains

Release🔬InterpretabilityEvals2d ago · 3 min read

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

🔬Reinforcement Learning3d ago

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

Workflow🔬OpenHands4d ago

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

🔬Search5d ago

OpenAI launches Parameter Golf with 16 MB models and 8xH100 training limit

🔬Benchmarks6d ago

Weights & Biases updates Models with synced robotics video playback and pinned baselines

🔬Observability1w ago

Google DeepMind launches Kaggle benchmark contest with $200k to measure AGI capabilities

🔬Benchmarks1w ago

Explore what's new in AI

Filter by tag in Research & Benchmarks

Fresh stories

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

OpenHands benchmarks EvoClaw and caps continuous-evolution scores at 38.03%

LLM Debate Benchmark ranks Sonnet 4.6 first across 1,162 side-swapped debates

Vals AI updates SWE-Bench Verified harness to mini-swe-agent and score slips to 78.8%

Top storiesthis week

llm-circuit-finder compares duplicated layers and reports BBH logical deduction gains

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

Autoresearch claims 2718 Elo after 70 experiments on a Rust chess engine

Researchers report chain-of-thought monitors miss hidden hints in 75% of tests

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

OpenAI launches Parameter Golf with 16 MB models and 8xH100 training limit

Weights & Biases updates Models with synced robotics video playback and pinned baselines

Google DeepMind launches Kaggle benchmark contest with $200k to measure AGI capabilities

llm-circuit-finder compares duplicated layers and reports BBH logical deduction gains

Cursor Composer 2 ranks #2 on Next.js evals, ahead of Opus and Gemini

Autoresearch claims 2718 Elo after 70 experiments on a Rust chess engine

Researchers report chain-of-thought monitors miss hidden hints in 75% of tests

Physical Intelligence introduces RL token for 15-minute robot refinement and 3x speedups

OpenHands compares 3 skill tasks and finds some reduce agent pass rates

Reason-ModernColBERT ranks 87.59 on BrowseComp-Plus

OpenAI launches Parameter Golf with 16 MB models and 8xH100 training limit

Weights & Biases updates Models with synced robotics video playback and pinned baselines

Google DeepMind launches Kaggle benchmark contest with $200k to measure AGI capabilities