Gemini 3 Pro beats GPT‑5.2 on CAIS – 57.1 vision index

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

After last week’s GPT‑5.2 hype cycle, today’s third‑party evals paint a more nuanced frontier picture. On CAIS’s new Text Capabilities Index, Gemini 3 Pro edges out GPT‑5.2 with a 47.6 average vs 45.9, and a big lead in expert‑level reasoning (38.3 vs 29.9) plus Terminal‑Bench coding (53.4 vs 47.7). GPT‑5.2 still owns ARC‑AGI‑2 abstract reasoning at 43.3 vs Gemini’s 31.1, so the “hard math and puzzles → OpenAI” routing rule holds.

CAIS’s Vision Index tilts even harder toward Google: Gemini 3 Pro clocks a 57.1 average vs GPT‑5.2’s 52.6, with standout scores on embodied ERQA (70.2 vs 60.7) and MindCube spatial navigation (77.3 vs 61.7). OCR‑Arena tells a similar story for documents: GPT‑5.2 Medium lands #4 with Elo 1648 and 34.3s latency per page, behind Gemini 3 Preview, Gemini 2.5 Pro, and Opus 4.5 Medium.

WeirdML adds cost texture: GPT‑5.2‑xhigh nudges ahead in accuracy (~0.722 vs Gemini 3 Pro’s ~0.699) but burns around $2.05 per run against $0.526 for Gemini. Meanwhile, builders are openly dunking on SimpleBench after it ranks 5.2 below older models, treating it as noisy trivia rather than a routing oracle. Net: use GPT‑5.2 for peak abstract reasoning, but Gemini 3 Pro looks like the current default for multimodal coding, agents, and doc‑heavy workloads.

Feature: Gemini gets deeper context and better UX

Gemini tightens end‑to‑end workflow: attach NotebookLM notebooks as context, mark up images to guide edits, smoother Live voice (mute, fewer cut‑offs), and speech‑to‑speech Translate—making Gemini more useful for real work.

Cross-account updates concentrated on Google’s Gemini app: NotebookLM notebooks as chat context, image markup to steer edits, Live voice interruptions fixed with mute, speech‑to‑speech in Translate, and signs of a fresh 3 Pro checkpoint.

Jump to Feature: Gemini gets deeper context and better UX topics

🧠 Feature: Gemini gets deeper context and better UX

Gemini app prepares NotebookLM notebook context as an attachment source

Gemini is starting to expose NotebookLM notebooks as a first‑class context source inside the app, so chats can be grounded in an entire research notebook instead of ad‑hoc uploads. A leaked attachment menu shows a new “NotebookLM” option alongside files, Drive, Photos, and code, while testers report being able to "attach notebooks as a context" to Gemini conversations for richer answers and follow‑ups integration teaser attachment menu.

For engineers and analysts, this points to a tighter fusion between Google’s long‑form notebook product and its chat UX: NotebookLM effectively becomes a persistent retrieval corpus you can attach in a click, rather than rebuilding context every session. If this ships broadly, expect workflows where a team maintains a living NotebookLM document and then hands it to Gemini as the single source of truth for deep research, doc QA, and agentic tasks, instead of juggling PDFs and ad‑hoc links feature article.

Gemini 3 Pro beats GPT‑5.2 on CAIS – 57.1 vision index

Executive Summary

Top links today

Feature: Gemini gets deeper context and better UX

Table of Contents

🧠 Feature: Gemini gets deeper context and better UX

Gemini app prepares NotebookLM notebook context as an attachment source

Builders spot an “af97” Gemini 3 Pro checkpoint under A/B test

Gemini app adds freehand image markup to steer visual edits

Gemini Live reduces mid‑sentence cut‑offs and adds a mute toggle

📊 Frontier evals: GPT‑5.2 vs Gemini 3 Pro across suites

CAIS Text Capabilities Index puts Gemini 3 Pro ahead, GPT‑5.2 best on ARC‑AGI‑2

CAIS Vision Index gives Gemini 3 Pro a clear edge over GPT‑5.2

WeirdML shows GPT‑5.2‑xhigh slightly ahead of Gemini 3 Pro but at 4× the cost

Builders start dismissing SimpleBench after GPT‑5.2 and Opus rankings

OCR‑Arena ranks GPT‑5.2 #4 behind Gemini and Opus on document reading

🛠️ Agent stacks and developer tooling updates

LangChain community ships open course for real phone-call agents

Oracle CLI adds Gemini web browsing, images and YouTube tools

PeopleHub turns LinkedIn due‑diligence into an AI agent workflow

Tutorial shows how to run stateful LangGraph agents on AWS Serverless

Clawdis grows into a distributed multi-device agent canvas

CodexBar adds Gemini support and better model usage telemetry

Warp terminal rolls out slash commands for agent and MCP workflows

SonosCLI gives agents and scripts first‑class control over Sonos speakers

🧩 Agent reliability and GraphRAG advances

Meta open-sources Confucius Code Agent with 54.3% on SWE‑Bench‑Pro

ReG (Weak‑to‑Strong GraphRAG) aligns graph retrievers to LLM reasoning

ReMe turns past tool runs into compact procedural memory for agents

VIGIL introduces a self‑healing runtime that patches failing agents in flight

🛡️ Safety, robustness and compliance signals

CNFinBench reveals LLM finance skills far outpacing rule‑following and risk controls

Google’s FACTS Leaderboard shows top models only ~69% factual in practice

Hidden PDF prompts can flip LLM peer reviews from reject to accept

OpenAI accused of soft‑pedaling AI job‑loss research as staffer quits

VEIL jailbreak exposes audio‑style backdoor in text‑to‑video safety filters

Small adversarial tweaks badly mislead LLM‑based vehicle trajectory prediction

🧪 Model science: diffusion LLMs, shared subspaces, verifier‑free RL

LLaDA 2.0 pushes diffusion LLMs to 100B parameters with big speed gains

DoGe “Decouple to Generalize” boosts VLM reasoning with context‑first RL

RARO shows verifier‑free RL can beat SFT on hard reasoning tasks

Universal Weight Subspace paper finds shared low‑dimensional directions across 1,100 nets

🤖 Embodied data engines and 3D reconstruction

X-Humanoid turns human videos into Tesla-style humanoid training data

Selfi uses VGGT features and splats for pose-free 3D reconstruction

🎬 Creative stacks: Runway 4.5, Nano Banana, Gemini Flash, VFX

Nano Banana and Gemini 3 form a UI design stack in SuperDesign

Nano Banana Pro adds BANANA INPAINT for mask-based scene edits

Runway Gen-4.5 frontier video now available to all paid users

Chronological Mirror uses Nano Banana Pro plus Veo for aging sequences

Freepik Spaces, Kling and Topaz chain a single image into a 1-minute music video

Gemini 3 Flash shows off 3k-line UI dumps and zero-shot animated SVGs

Kling 2.6 is being used across full VFX pipelines, not just text-to-video

Higgsfield Shots one-clicks nine different camera angles per scene

VEED Fabric 1.0 video model lands as an API on Replicate

💼 Enterprise adoption and open‑source signals

Enterprises report real AI productivity gains in HR, coding and operations

VC data shows AI startups dominating application revenue while incumbents hold infra

3,000 Reachy Mini humanoid robots ship worldwide as open robotics platform

Chorus AI chat app open sources and moves to pay‑as‑you‑go APIs

CopilotKit hits #1 on GitHub trending as devs adopt in‑app agent framework

New open‑source SonosCLI offers fast Go-based control for Sonos fleets

On this page