Gemini 3 Pro beats GPT‑5.2 on CAIS – 57.1 vision index feature image for Sat, Dec 13, 2025

Gemini 3 Pro beats GPT‑5.2 on CAIS – 57.1 vision index

Stay in the loop

Free daily newsletter & Telegram daily report

Join Telegram Channel

Executive Summary

After last week’s GPT‑5.2 hype cycle, today’s third‑party evals paint a more nuanced frontier picture. On CAIS’s new Text Capabilities Index, Gemini 3 Pro edges out GPT‑5.2 with a 47.6 average vs 45.9, and a big lead in expert‑level reasoning (38.3 vs 29.9) plus Terminal‑Bench coding (53.4 vs 47.7). GPT‑5.2 still owns ARC‑AGI‑2 abstract reasoning at 43.3 vs Gemini’s 31.1, so the “hard math and puzzles → OpenAI” routing rule holds.

CAIS’s Vision Index tilts even harder toward Google: Gemini 3 Pro clocks a 57.1 average vs GPT‑5.2’s 52.6, with standout scores on embodied ERQA (70.2 vs 60.7) and MindCube spatial navigation (77.3 vs 61.7). OCR‑Arena tells a similar story for documents: GPT‑5.2 Medium lands #4 with Elo 1648 and 34.3s latency per page, behind Gemini 3 Preview, Gemini 2.5 Pro, and Opus 4.5 Medium.

WeirdML adds cost texture: GPT‑5.2‑xhigh nudges ahead in accuracy (~0.722 vs Gemini 3 Pro’s ~0.699) but burns around $2.05 per run against $0.526 for Gemini. Meanwhile, builders are openly dunking on SimpleBench after it ranks 5.2 below older models, treating it as noisy trivia rather than a routing oracle. Net: use GPT‑5.2 for peak abstract reasoning, but Gemini 3 Pro looks like the current default for multimodal coding, agents, and doc‑heavy workloads.

Top links today

Feature Spotlight

Feature: Gemini gets deeper context and better UX

Gemini tightens end‑to‑end workflow: attach NotebookLM notebooks as context, mark up images to guide edits, smoother Live voice (mute, fewer cut‑offs), and speech‑to‑speech Translate—making Gemini more useful for real work.

Cross-account updates concentrated on Google’s Gemini app: NotebookLM notebooks as chat context, image markup to steer edits, Live voice interruptions fixed with mute, speech‑to‑speech in Translate, and signs of a fresh 3 Pro checkpoint.

Jump to Feature: Gemini gets deeper context and better UX topics

Table of Contents

🧠 Feature: Gemini gets deeper context and better UX


📊 Frontier evals: GPT‑5.2 vs Gemini 3 Pro across suites


🛠️ Agent stacks and developer tooling updates


🧩 Agent reliability and GraphRAG advances


🛡️ Safety, robustness and compliance signals


🧪 Model science: diffusion LLMs, shared subspaces, verifier‑free RL


🤖 Embodied data engines and 3D reconstruction


🎬 Creative stacks: Runway 4.5, Nano Banana, Gemini Flash, VFX


💼 Enterprise adoption and open‑source signals

On this page

Executive Summary
Feature Spotlight: Feature: Gemini gets deeper context and better UX
🧠 Feature: Gemini gets deeper context and better UX
Gemini app prepares NotebookLM notebook context as an attachment source
Builders spot an “af97” Gemini 3 Pro checkpoint under A/B test
Gemini app adds freehand image markup to steer visual edits
Gemini Live reduces mid‑sentence cut‑offs and adds a mute toggle
📊 Frontier evals: GPT‑5.2 vs Gemini 3 Pro across suites
CAIS Text Capabilities Index puts Gemini 3 Pro ahead, GPT‑5.2 best on ARC‑AGI‑2
CAIS Vision Index gives Gemini 3 Pro a clear edge over GPT‑5.2
WeirdML shows GPT‑5.2‑xhigh slightly ahead of Gemini 3 Pro but at 4× the cost
Builders start dismissing SimpleBench after GPT‑5.2 and Opus rankings
OCR‑Arena ranks GPT‑5.2 #4 behind Gemini and Opus on document reading
🛠️ Agent stacks and developer tooling updates
LangChain community ships open course for real phone-call agents
Oracle CLI adds Gemini web browsing, images and YouTube tools
PeopleHub turns LinkedIn due‑diligence into an AI agent workflow
Tutorial shows how to run stateful LangGraph agents on AWS Serverless
Clawdis grows into a distributed multi-device agent canvas
CodexBar adds Gemini support and better model usage telemetry
Warp terminal rolls out slash commands for agent and MCP workflows
SonosCLI gives agents and scripts first‑class control over Sonos speakers
🧩 Agent reliability and GraphRAG advances
Meta open-sources Confucius Code Agent with 54.3% on SWE‑Bench‑Pro
ReG (Weak‑to‑Strong GraphRAG) aligns graph retrievers to LLM reasoning
ReMe turns past tool runs into compact procedural memory for agents
VIGIL introduces a self‑healing runtime that patches failing agents in flight
🛡️ Safety, robustness and compliance signals
CNFinBench reveals LLM finance skills far outpacing rule‑following and risk controls
Google’s FACTS Leaderboard shows top models only ~69% factual in practice
Hidden PDF prompts can flip LLM peer reviews from reject to accept
OpenAI accused of soft‑pedaling AI job‑loss research as staffer quits
VEIL jailbreak exposes audio‑style backdoor in text‑to‑video safety filters
Small adversarial tweaks badly mislead LLM‑based vehicle trajectory prediction
🧪 Model science: diffusion LLMs, shared subspaces, verifier‑free RL
LLaDA 2.0 pushes diffusion LLMs to 100B parameters with big speed gains
DoGe “Decouple to Generalize” boosts VLM reasoning with context‑first RL
RARO shows verifier‑free RL can beat SFT on hard reasoning tasks
Universal Weight Subspace paper finds shared low‑dimensional directions across 1,100 nets
🤖 Embodied data engines and 3D reconstruction
X-Humanoid turns human videos into Tesla-style humanoid training data
Selfi uses VGGT features and splats for pose-free 3D reconstruction
🎬 Creative stacks: Runway 4.5, Nano Banana, Gemini Flash, VFX
Nano Banana and Gemini 3 form a UI design stack in SuperDesign
Nano Banana Pro adds BANANA INPAINT for mask-based scene edits
Runway Gen-4.5 frontier video now available to all paid users
Chronological Mirror uses Nano Banana Pro plus Veo for aging sequences
Freepik Spaces, Kling and Topaz chain a single image into a 1-minute music video
Gemini 3 Flash shows off 3k-line UI dumps and zero-shot animated SVGs
Kling 2.6 is being used across full VFX pipelines, not just text-to-video
Higgsfield Shots one-clicks nine different camera angles per scene
VEED Fabric 1.0 video model lands as an API on Replicate
💼 Enterprise adoption and open‑source signals
Enterprises report real AI productivity gains in HR, coding and operations
VC data shows AI startups dominating application revenue while incumbents hold infra
3,000 Reachy Mini humanoid robots ship worldwide as open robotics platform
Chorus AI chat app open sources and moves to pay‑as‑you‑go APIs
CopilotKit hits #1 on GitHub trending as devs adopt in‑app agent framework
New open‑source SonosCLI offers fast Go-based control for Sonos fleets