releaseMarch 25, 2026

ARC Prize launches ARC-AGI-3: Gemini 3.1 Pro scores 0.37%

ARC-AGI-3 swaps static puzzles for interactive game-like environments and posts initial frontier scores below 1%, with Gemini 3.1 Pro at 0.37%. Teams can use it to inspect agent reasoning, but score interpretation still depends heavily on the human-efficiency metric and no-harness setup.

GPT Gemini Evals Benchmarks

4 min read

ARC Prize launches ARC-AGI-3: Gemini 3.1 Pro scores 0.37%

TL;DR

ARC Prize launched ARC-AGI-3 as an interactive successor to the earlier static benchmarks, with 135 novel environments built to test whether agents can “explore the environment,” “form hypotheses,” and “learn and adapt” without natural-language instructions launch thread benchmark page.
Initial verified scores are extremely low: ARC Prize says humans reach 100% while current AI is below 1% launch thread, and an early leaderboard snapshot puts Gemini 3.1 Pro at 0.3%-0.37%, GPT-5.4 around 0.26%-0.3%, Opus 4.6 around 0.2%-0.25%, and Grok 4.20 at 0.0% in one posted result score snapshot leaderboard recap.
The benchmark is aimed at agent loops rather than static answer quality: according to the