OpenAI GPT‑5.2 beats 74% of experts on GDPval – 11× cheaper work

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

OpenAI finally answered the Devstral and Gemini noise with a workhorse: GPT‑5.2 is live across ChatGPT and the API, and on GDPval it now beats or ties human experts on roughly 71–74% of real knowledge‑work tasks. Those tasks usually take 4–8 hours and $150–$200 of billable time; OpenAI claims 5.2 does them over 11× faster at under 1% of the expert cost. This is the first OpenAI release that feels explicitly aimed at replacing mid‑career desk work, not toy prompts.

Product‑wise, you get three tiers: Instant for everyday chat, Thinking for serious reasoning, and Pro as the slow, heavy hitter. Standard 5.2 runs at $1.75 / $14 per 1M input / output tokens with a 400k context window; Pro jumps to $21 / $168 and lives only behind the Responses API for multi‑minute, high‑effort traces. Benchmarks back the positioning: 55.6% on SWE‑Bench Pro, 92.4% on GPQA Diamond, 100% on AIME 2025, and ARC Prize‑verified 90.5% on ARC‑AGI‑1 with a 390× cost‑efficiency gain over last year’s o3 preview.

The System Card is the other big story: hallucinations drop about 30–40%, deceptive tool use falls from 7.7% to 1.6%, and long‑context retrieval stays near‑perfect out past 128k tokens. It’s clearly a better brain for agents and coders, but not magic—you still need routing, evals, and human checks where mistakes are expensive.

Feature: OpenAI ships GPT‑5.2 for work and agents

OpenAI releases GPT‑5.2 across ChatGPT and API with expert‑level GDPval, major long‑context and coding gains, and clear pricing tiers—including a costly Pro. Sets the competitive bar for enterprise tasks and agent workflows.

Cross‑account launch dominates today: GPT‑5.2 (Instant, Thinking, Pro) lands in ChatGPT and API with big gains on real‑world knowledge work, coding, and long‑context; pricing and system card details included. Mostly model/eval posts; separate sections exclude this launch.

Jump to Feature: OpenAI ships GPT‑5.2 for work and agents topics

🧄 Feature: OpenAI ships GPT‑5.2 for work and agents

GPT-5.2 hits human-expert level on GDPval knowledge work benchmark

On GDPval—the most concrete benchmark so far for economically valuable knowledge work—GPT‑5.2 Thinking is the first OpenAI model that beats or ties industry professionals on 70.9% of tasks, with GPT‑5.2 Pro climbing to 74.1%, versus GPT‑5’s 38.8% win/tie rate just a few months ago. gdpval table

These tasks span 44 occupations and look like real jobs: presentations, spreadsheets, urgent care schedules, financial models, manufacturing diagrams, and more, each normally taking human experts 4–8 hours and often being worth $150–$200 of billable time. gdpval explainer OpenAI also reports that GPT‑5.2 produced GDPval outputs >11× faster and at <1% of expert cost, assuming raw model time and API pricing and ignoring human review overhead. gdpval win chart Commentators like Ethan Mollick and Daniel Miessler are calling this the economically relevant result of the launch, with Mollick noting that in head‑to‑head, expert‑judged comparisons “GPT‑5.2 wins 71% of the time” on work that people actually get paid for. (emollick reaction, labor market take) The caveat is that GDPval covers well‑specified tasks with clear instructions and examples. It doesn’t capture open‑ended discovery, persuasion, or organizational navigation. For AI leads, though, it’s strong evidence that—with good prompting and oversight—you can hand GPT‑5.2 large chunks of structured knowledge work and expect outputs at or above mid‑career human quality most of the time, at a fraction of the time and cost.

OpenAI GPT‑5.2 beats 74% of experts on GDPval – 11× cheaper work

Executive Summary

Top links today

Feature: OpenAI ships GPT‑5.2 for work and agents

Table of Contents

🧄 Feature: OpenAI ships GPT‑5.2 for work and agents

GPT-5.2 hits human-expert level on GDPval knowledge work benchmark

OpenAI launches GPT-5.2 Instant, Thinking, and Pro for ChatGPT and API

ARC Prize verifies GPT-5.2 Pro as new ARC-AGI SOTA with 390× efficiency gain

GPT-5.2 pricing, context window, and Pro tier economics

GPT-5.2 Thinking tops most OpenAI, Anthropic, and Google benchmarks

Builders report big gains from GPT-5.2 for coding and agents, with caveats

GPT-5.2 sharply improves long-context retrieval on MRCRv2

System Card: GPT-5.2 cuts hallucinations and deceptive tool use

🕸️ Google’s Interactions API and Deep Research agent

Gemini Deep Research hits SOTA on HLE, DeepSearchQA and BrowseComp

Google ships Interactions API with Gemini Deep Research and MCP tools

Builders begin adopting Interactions API for long-horizon Gemini agents

Google tests Disco, a Gemini agent that turns your tabs into task apps

📈 Frontier eval race: third‑party verifications and ladders

Context Arena’s MRCR shows GPT‑5.2’s long‑context wins come from heavy reasoning effort

Vals Index crowns GPT‑5.2 #1 but notes ~4× higher query cost

GPT‑5.2‑high debuts #2 on Code Arena WebDev leaderboard

LisanBench and community evals show GPT‑5.2 Thinking better but not best at reasoning

Misc community benches (VPCT, LiveBench, SWE‑Bench Verified) show mixed GPT‑5.2 picture

Opus 4.5 holds Terminal‑Bench lead as Warp pushes GPT‑5.2 to 61.1%

Extended NYT Connections benchmark shows clear GPT‑5.2 gains, Gemini still ahead

🛠️ Coding agents and IDEs: design‑in‑code and 5.2 wiring

CopilotKit v1.50 introduces `useAgent()` hook and LangGraph/Mastra adapters

Factory’s Droid adopts GPT‑5.2 for architecture, data and sysadmin tasks

Kilo Code tunes its agents around GPT‑5.2 for UI and bug‑fix work

RepoPrompt 5.2 adds GPT‑5.2 and background jobs for Pro‑length tasks

Zed editor turns on GPT‑5.2 for Pro and BYOK users

Conductor wires GPT‑5.2 into its agentic orchestration for coding flows

LlamaIndex ships `ask` CLI and LlamaSheets for table‑centric agents

MagicPathAI ships GPT‑5.2 Thinking for UI layout and data‑viz design

Rork switches to GPT‑5.2 for long‑context UI and frontend work

Julius AI exposes GPT‑5.2 for spreadsheet‑heavy data analysis

💼 Enterprise & deals: Disney x OpenAI and platform adoption

Disney puts $1B into OpenAI and signs three‑year Sora content deal

Box AI’s internal evals push GPT‑5.2 into production

Disney’s cease‑and‑desist to Google underscores shifting AI content alliances

OpenAI teams with Rappi to push ChatGPT Go across Latin America

Windsurf and Devin move core workloads to GPT‑5.2

GPT‑5.2 Pro debuts near the top of OpenRouter’s price table

Notion’s ‘olive‑oil‑cake’ hook hints at GPT‑5.2 under the hood

Perplexity turns GPT‑5.2 into a first‑class Pro/Max model option

Research and coding SaaS tools race to wire in GPT‑5.2

🛡️ Safety, robustness and policy moves

GPT-5.2 System Card shows big drops in deception and hallucinations

GPT‑5.2 jailbreaks resurface, and OpenAI starts emailing enforcement warnings

OpenAI plans ChatGPT “adult mode” with age‑prediction‑based gating in 2026

Nvidia prepares on‑device location verification to curb AI GPU smuggling

US executive order aims for a single national AI framework instead of 50 state laws

🎬 Generative media and vision: video tools and pipelines

Runway Gen‑4.5 goes live as a physics‑aware “world engine” for video

Fal hosts Creatify Aurora for single‑image talking avatars with rich motion

Fal ships Wan Vision Enhancer and Flux upscaler for smarter video and image upgrades

Gemini 3 Pro outperforms GPT‑5.2 on detailed motherboard understanding

Google Labs’ Pomelli adds Animate, using Veo 3.1 to turn static designs into motion

OmniPSD generates fully layered PSDs directly from prompts with a diffusion transformer

StereoWorld turns monocular videos into geometry‑aware stereo for 3D viewing

Invideo debuts AI film tool that stylizes footage while preserving actor performance

🏗️ AI infra & networking economics

Broadcom CEO reveals $73B AI data‑center networking backlog

Nvidia plans GPU location‑verification to curb export‑control evasion

📚 Fresh research: agent scaling laws, diffusion LLMs, code‑from‑papers

Scaling laws for multi‑agent systems show small average gains, big variance

d3LLM uses new AUP metric to push diffusion LLMs up to 10× faster

DeepCode agent rebuilds code from papers and edges out PhD baselines

Co‑evolution of swarm algorithms and prompts boosts LLM‑designed solvers

PathHD shows single‑call QA over knowledge graphs with hypervectors

🗣️ Voice ecosystems: platform reach and cost calculus

ElevenLabs audio now powers Instagram, Horizon and more

Gemini Live two‑way voice runs around 1–2 cents per minute

Gemini speech models show flexible singing and playful audio

On this page