Gemini 3 Deep Think hits 41% HLE – Ultra users get parallel reasoning

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Gemini 3 Deep Think is finally live for Google AI Ultra subscribers in the Gemini app, and the numbers back up the fanfare. On Humanity’s Last Exam it scores 41% with tools off, beating Gemini 3 Pro at 37.5% and GPT‑5 Pro at 30.7%. GPQA Diamond lands at 93.8%, ahead of both Gemini 3 Pro and GPT‑5.1, while ARC‑AGI‑2 visual puzzles climb to 45.1% with tools on, versus 31.1% for Gemini 3 Pro and mid‑teens for GPT‑5‑class baselines and Claude Sonnet 4.5.

What’s new here is the surface and the stack. Deep Think is not a cute toggle on Pro; it’s a more compute‑hungry reasoning backend doing parallel hypothesis search, descended from the Gemini 2.5 Deep Think models that hit IMO and ICPC gold. In the Gemini app you enable “Deep Think” and select the “Thinking” model, then treat it like a manual turbo switch for brutal math, science, and multi‑file code instead of your default chat.

Early builders are split: some call it the smartest model they’ve touched, others see frozen or broken outputs “for days.” After a week of buzz around open DeepSeek V3.2 and GPT‑5.1‑Codex‑Max, Google’s counter is clear: frontier‑level reasoning, but for now locked behind Ultra and best used as a specialist, not your always‑on assistant.

Feature: Gemini 3 Deep Think rolls out with parallel reasoning

Gemini 3 Deep Think is live for Ultra subscribers with parallel hypothesis reasoning; early charts show 41% on Humanity’s Last Exam, 45.1% ARC‑AGI‑2 (tools on), 93.8% GPQA, and real‑world coding demos.

Cross‑account surge around Gemini 3 Deep Think going live for Google AI Ultra subscribers in the Gemini app. Mostly launch how‑tos plus new benchmark charts showing sizable gains on HLE/ARC‑AGI‑2 and a coding demo; little else competes for volume today.

Jump to Feature: Gemini 3 Deep Think rolls out with parallel reasoning topics

🧠 Feature: Gemini 3 Deep Think rolls out with parallel reasoning

Deep Think tops HLE and ARC-AGI-2 while nudging past GPT‑5 on science

New evals from Google DeepMind show Gemini 3 Deep Think jumping ahead of both prior Gemini versions and key competitors on several hard reasoning benchmarks. On Humanity’s Last Exam (HLE, tools off), it scores 41%, up from 37.5% for Gemini 3 Pro and above GPT‑5 Pro at 30.7%, while Claude Sonnet 4.5 lags at 13.7%. benchmark breakdown

On GPQA Diamond (graduate-level science questions, tools off), Deep Think reaches 93.8%, edging Gemini 3 Pro at 91.9% and GPT‑5 Pro at 88.4%, and slightly ahead of GPT‑5.1 at 88.1% and Claude Sonnet 4.5 at 83.4%. benchmark breakdown On ARC‑AGI‑2 visual reasoning puzzles, Deep Think with tools on hits 45.1%, versus 31.1% for Gemini 3 Pro (tools off) and mid‑teens for GPT‑5 Pro and GPT‑5.1; earlier Gemini 2.5 Pro sat at 4.9%, and Claude Sonnet 4.5 at 13.6%. benchmark breakdown A separate comparison adds Claude Opus 4.5 into the mix: Deep Think’s 45.1% on ARC‑AGI‑2 tops Opus 4.5 at 37.6%, and its 41% HLE score beats Opus 4.5’s 28.4%, while GPQA Diamond remains tightly clustered (Deep Think 93.8% vs Opus 4.5 at 87%). opus comparison chart Commentators are already extrapolating these gains forward, speculating that another year of scaling could push HLE toward 60% and ARC‑AGI‑2 past 50%, which would materially change what you can safely offload to agents that operate without tools. future projections For engineers and analysts, the takeaway is that Deep Think isn’t only a UI toggle—it’s running a different, more compute‑hungry reasoning stack that currently leads public charts on the hardest multi‑step tasks.

Gemini 3 Deep Think hits 41% HLE – Ultra users get parallel reasoning

Executive Summary

Top links today

Feature: Gemini 3 Deep Think rolls out with parallel reasoning

Table of Contents

🧠 Feature: Gemini 3 Deep Think rolls out with parallel reasoning

Deep Think tops HLE and ARC-AGI-2 while nudging past GPT‑5 on science

Gemini 3 Deep Think rolls out to Google AI Ultra users in Gemini app

Builders test Deep Think for heavy coding and math, with strong but mixed early feedback

🧪 Fresh models and endpoints

GPT-5.1-Codex-Max spreads from Responses API into major dev tools

DeepSeek V3.2 arrives in LLM Gateway with multi-provider routing

Mistral Large 3 lands on Ollama Cloud under a 675B MoE tag

Microsoft ships VibeVoice-Realtime-0.5B for low-latency streaming TTS

🧰 Agent IDEs and coding workflows

Conductor uses hidden git refs and GPT‑5 to build “time‑travel” checkpointing

Cursor overhauls its Codex harness for GPT‑5.1‑Codex‑Max

Linear adds first‑class Codex delegation for engineering backlogs

GPT‑5.1‑Codex‑Max rapidly lands in Windsurf, VS Code, Droid and Copilot

Warp terminal adds a GUI file tree that ties into its agent flows

CodexBar adds CLI and status views to track Codex and Claude usage

RepoPrompt 1.5.46 adds full GPT‑5.1‑Codex‑Max support for repo‑scale prompting

🧩 Agent plumbing: connectors, MCP, and routing

Firecrawl plugs into Google ADK for multi‑agent web scraping and search

AG‑UI lands in AWS Strands docs as the chat front‑end for agents

CocoIndex ships a Claude Code Skill for building data pipelines from inside the agent

LLM Gateway adds DeepSeek V3.2 with multi‑provider routing and tool flags

⚙️ Serving and speed: inference recipes

Together’s AutoJudge promises 1.5–2× faster inference by learning token importance

Baseten claims DeepSeek V3.2 at ~0.22s TTFT and 191 tps

vLLM publishes DeepSeek‑V3.2 serving recipe with “thinking” and tools

📊 Leaderboards and evals: coding, vision, long tasks

GPT‑5.1‑Codex‑Max posts strong coding gains on VibeCodeBench and SWE‑Bench

METR finds GPT‑5.1‑Codex‑Max can autonomously handle 2–3 hour software tasks half the time

Claude Opus 4.5 tops new AutoCodeBench‑V2 coding benchmark at 82.9%

GPT‑5.1‑high climbs to #3 on Arena vision leaderboard, GPT‑5.1 to #4

DeepSeek‑v3.2 lands mid‑pack on Arena text leaderboard but leads open models in Math and Legal

MiniCPM‑4.1‑8B outpaces Ministral‑3‑8B on most benchmarks while staying ~2× faster

🧮 Reasoning/RL: from CUDA kernels to adaptive thinking

Tencent’s R‑Few lets LLMs self‑evolve with 1–5% human labels

Argos agentic verifier makes multimodal RL rewards denser and more grounded

Cross‑family verifiers give the biggest accuracy gains on math/logic tasks

CUDA-L2 code release invites reuse on non‑GEMM ops

Omni‑AutoThink trains multimodal models to think only on hard inputs

Prompt‑free verify‑and‑refine agents improve automated paper‑to‑code reproduction

Deep Research survey maps query planning, tools and RL for research agents

Grokked models forget specific data faster and with less collateral damage

TradeTrap shows tiny state corruptions can push LLM trading agents into 61% losses

HealthContradict benchmark exposes how LLMs handle conflicting medical evidence

💼 Enterprise deployment and GTM

Brazil’s biggest bank rolls Devin out across its entire SDLC

Snowflake and Anthropic deepen $200M Claude partnership around Cortex AI

Anthropic’s AI Interviewer pilots at scale and surfaces how pros really use AI

🗂️ Search, parsing and RAG pipelines

Datalab adds layout‑aware spreadsheet parsing at $6 per 1k “pages”

Exa turns 5,000+ NeurIPS papers into a semantic search corpus

LlamaIndex dissects OlmOCR‑Bench for document understanding and RAG

Weaviate wins AWS Rising Star award and ships Java client v6

🏗️ Compute economics and build‑out signals

AMD accepts 15% revenue skim on MI308 exports to keep China AI market access

Token prices for top reasoning models are collapsing fastest at the high end

AMD pegs AI to a 10‑year supercycle and deepens its OpenAI deal

US data center construction is closing in on general office spend

Jensen Huang puts small nuclear reactors on a 6–7 year timeline for AI power

Meta’s 30% metaverse cuts free up budget for AI infrastructure

🛡️ Safety, security and governance pressures

Judge details 20M‑chat discovery order against OpenAI in copyright case

Doublespeak attack shows in‑context “representation hijacking” of safety filters

SUSVIBES benchmark finds vibe‑coding agents ship vulnerable code ~90% of the time

Concordia paper maps systemic risks when LLM agents talk to each other

Perplexity’s BrowseSafe verifier hits ~90 F1 on prompt‑injection detection

TradeTrap shows small perturbations can wreck LLM trading agents’ portfolios

HealthContradict benchmark tests LLMs on conflicting medical evidence

🎬 Creative stacks: video+audio, image edits, and tooling

Kling Video 2.6 and Avatars 2.0 spread to arenas, agents and hosts

Higgsfield and Flowith codify grid→still→animation video workflows

Nano Banana Pro moves from Twitter art to TV ads

NotebookLM on iOS now generates infographics, slide decks, and 10k‑char personas

Seedream 4.5 jumps to #3 Image Edit and #7 T2I on AA leaderboards

ElevenLabs agents power Brock Purdy fan experience and creative summit talk