
Gemini 3 Pro beats GPT‑5.2 on CAIS – 57.1 vision index
Stay in the loop
Free daily newsletter & Telegram daily report
Executive Summary
After last week’s GPT‑5.2 hype cycle, today’s third‑party evals paint a more nuanced frontier picture. On CAIS’s new Text Capabilities Index, Gemini 3 Pro edges out GPT‑5.2 with a 47.6 average vs 45.9, and a big lead in expert‑level reasoning (38.3 vs 29.9) plus Terminal‑Bench coding (53.4 vs 47.7). GPT‑5.2 still owns ARC‑AGI‑2 abstract reasoning at 43.3 vs Gemini’s 31.1, so the “hard math and puzzles → OpenAI” routing rule holds.
CAIS’s Vision Index tilts even harder toward Google: Gemini 3 Pro clocks a 57.1 average vs GPT‑5.2’s 52.6, with standout scores on embodied ERQA (70.2 vs 60.7) and MindCube spatial navigation (77.3 vs 61.7). OCR‑Arena tells a similar story for documents: GPT‑5.2 Medium lands #4 with Elo 1648 and 34.3s latency per page, behind Gemini 3 Preview, Gemini 2.5 Pro, and Opus 4.5 Medium.
WeirdML adds cost texture: GPT‑5.2‑xhigh nudges ahead in accuracy (~0.722 vs Gemini 3 Pro’s ~0.699) but burns around $2.05 per run against $0.526 for Gemini. Meanwhile, builders are openly dunking on SimpleBench after it ranks 5.2 below older models, treating it as noisy trivia rather than a routing oracle. Net: use GPT‑5.2 for peak abstract reasoning, but Gemini 3 Pro looks like the current default for multimodal coding, agents, and doc‑heavy workloads.
Top links today
- FACTS leaderboard benchmark for LLM factuality
- Confucius Code Agent open source coding agent
- Universal Weight Subspace Hypothesis paper
- LLM scientific review indirect prompt injection
- VEIL jailbreak attacks on text-to-video models
- Empirical study of human LLM coding collaboration
- CNFinBench benchmark for LLM finance safety
- Verifier free LLM reasoning via demonstrations
- VIGIL runtime for self healing LLM agents
- X-Humanoid human video to humanoid robot
- DoGe decouple to generalize vision language
- Robustness of LLM based trajectory prediction
Feature Spotlight
Feature: Gemini gets deeper context and better UX
Gemini tightens end‑to‑end workflow: attach NotebookLM notebooks as context, mark up images to guide edits, smoother Live voice (mute, fewer cut‑offs), and speech‑to‑speech Translate—making Gemini more useful for real work.
Cross-account updates concentrated on Google’s Gemini app: NotebookLM notebooks as chat context, image markup to steer edits, Live voice interruptions fixed with mute, speech‑to‑speech in Translate, and signs of a fresh 3 Pro checkpoint.
Jump to Feature: Gemini gets deeper context and better UX topics