Skip to content
AI Primer

Explore what's new in AI

The most full AI hub: fresh stories, workflows, prompts, deals. Updated daily.

Filters

Category

Tags

Breaking

Meerkat reports harness-level cheating across 28+ submissions on nine agent benchmarks

Meerkat and Berkeley RDI audits said popular agent leaderboards were inflated by harness-level leakage and eval gaming, with one cleaned entry dropping from first to 14th. That makes published coding-agent rankings and benchmark comparisons less reliable, so treat leaderboard results with caution.

Meerkat reports harness-level cheating across 28+ submissions on nine agent benchmarks
New
Benchmarks·11th April·5 min read
Breaking

Vercel Sandbox benchmarks sub-500 ms node -v cold starts

Vercel said Sandbox is now the fastest microVM-based runtime, with fresh node -v cold starts now largely under 500 ms after a month of tuning. The update also puts persistent sandboxes into beta and expands plans for a programmable firewall, so teams should re-check runtime and security settings.

Vercel Sandbox benchmarks sub-500 ms node -v cold starts
New
Sandboxing·11th April·4 min read
See all stories →
🤖Agentic Engineering(26)
🧩Agent Development(2)
🧠Models & APIs(12)
Inference & Infrastructure(2)
🔒Security & Reliability(1)
🔬Research & Benchmarks(1)
📌Other(1)

Top storiesthis week

Breaking

MirrorCode benchmarks Claude Opus 4.6 on a 16,000-line software reimplementation

Epoch AI and METR introduced MirrorCode, a long-horizon benchmark where models reimplement software from execution-only access; Opus 4.6 completed a 16,000-line bioinformatics toolkit. The authors say oracle tests and memorization risks still limit how directly the result maps to everyday software work.

MirrorCode benchmarks Claude Opus 4.6 on a 16,000-line software reimplementation
New
Claude·10th April·5 min read
See all stories →
AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.