OpenAI GPT‑5.2 multiplies tokens 6× on agents – 1T‑token debut

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Two days after OpenAI’s GPT‑5.2 launch, independent evals are filling in the fine print. On Artificial Analysis’ GDPval‑AA run, GPT‑5.2 xhigh tops the agentic leaderboard at 1474 ELO, edging Claude Opus 4.5 at 1413, but burns ~250M tokens and ~$610 across 220 tasks—over 6× GPT‑5.1’s usage.

ARC‑AGI‑2 tells the same story: GPT‑5.2 high reaches ~52.9% at ~$1.39 per problem vs GPT‑5’s 10% at $0.73, and xhigh and Pro runs go higher still. Meanwhile on SimpleBench’s trick questions, GPT‑5.2 base scores 45.8% and Pro 57.4%, trailing GPT‑5 Pro and far behind Gemini 3 Pro at 76.4%, reinforcing that “more thinking” mainly helps long, structured work.

Usage doesn’t seem scared off: Sam Altman says GPT‑5.2 cleared 1T API tokens on day one despite roughly 40% higher list prices. VendingBench‑2 and Epoch’s ECI put it in the same long‑horizon league as Gemini 3 Pro and Opus 4.5 rather than a runaway winner. The practical takeaway: treat xhigh reasoning like a surgical tool for quarterly plans, audits, and gnarly debugging, while cheaper, faster models handle your inner loops.

Feature: GPT‑5.2 reality check—evals vs cost and latency

GPT‑5.2 leads GDPval‑AA but burns ~250M tokens and ~$608 per run; underperforms on SimpleBench and is slow at xhigh. Engineers must weigh accuracy gains against cost/latency for real agent workflows.

Day‑two picture for GPT‑5.2: strong agentic task wins but mixed third‑party benchmarks and high token spend. Threads focus on SimpleBench underperformance, GDPval-AA costs, and heavy xhigh reasoning latency for AGI‑style gains.

Jump to Feature: GPT‑5.2 reality check—evals vs cost and latency topics

🧄 Feature: GPT‑5.2 reality check—evals vs cost and latency

GPT‑5.2 tops GDPval‑AA ELO but at ~250M tokens for 220 tasks

Artificial Analysis’ GDPval‑AA run now has GPT‑5.2 (xhigh) at the top of its agentic knowledge‑work leaderboard with an ELO of 1474, beating Claude Opus 4.5 at 1413 and GPT‑5 (high) at 1305. gdpval summary This is a third‑party rerun of OpenAI’s GDPval dataset, following up on gdpval expert wins which covered OpenAI’s own 70.9% win‑or‑tie rate vs human professionals.

The win comes with a clear price tag: running GPT‑5.2 through all 220 agentic tasks cost about $608–$620 and used ~250M tokens, versus ~$88 and ~40M tokens for GPT‑5.1 on the same harness, implying >6× more token usage for the new model. (run cost tweet, token usage tweet) Most of that overhead comes from letting GPT‑5.2 use the xhigh reasoning effort setting inside Artificial Analysis’ Stirrup agent framework, which encourages long tool‑using chains for each task. stirrup github repo For teams, the takeaway is that GPT‑5.2 xhigh really does buy state‑of‑the‑art agentic performance on realistic business workflows, but only pencils out when each task is worth a few dollars of compute—more like quarterly planning decks and complex RFPs than everyday chat or light ETL.

OpenAI GPT‑5.2 multiplies tokens 6× on agents – 1T‑token debut

Executive Summary

Top links today

Feature: GPT‑5.2 reality check—evals vs cost and latency

Table of Contents

🧄 Feature: GPT‑5.2 reality check—evals vs cost and latency

GPT‑5.2 tops GDPval‑AA ELO but at ~250M tokens for 220 tasks

Xhigh reasoning boosts GPT‑5.2’s scores but raises per‑task cost and latency

Builders reserve GPT‑5.2 for deep audits while leaning on faster models day‑to‑day

Sam Altman says GPT‑5.2 crossed 1T API tokens on day one

SimpleBench shows GPT‑5.2 underperforming older GPT‑5 and Claude Opus

Epoch’s ECI suggests ~3.5‑hour METR time horizon for GPT‑5.2

VendingBench‑2: GPT‑5.2 improves on GPT‑5.1 but trails Gemini 3 and Opus 4.5

🔎 Autonomous research agents: HLE/DeepSearchQA race

Gemini Deep Research gets Interactions API access and the DeepSearchQA benchmark

Zoom’s federated AI edges Gemini Deep Research on Humanity’s Last Exam

🧰 Coding agents & skills: ChatGPT/Codex skills, IDE QoL, ops

OpenAI quietly rolls out reusable “skills” for ChatGPT and Codex

Claude Code adds Android client, async runs, and desktop local files

CopilotKit’s `useAgent()` turns any React app into an agent console

Cursor rapidly iterates on its new visual editor based on feedback

Oracle CLI adds GPT‑5.2 Pro with extended thinking and sturdier uploads

Acontext launches as a context and skill memory layer for agents

Anthropic memory tool lands in Vercel AI SDK for persistent agent context

Helicone adds one‑line DSPy integration for agent observability

🎙️ Realtime voice: Gemini Live, Translate, and call UX

Gemini 2.5 Flash Native Audio tightens tool calls and multi-turn voice

Google Translate adds Gemini-powered live speech-to-speech with headphones

LiveKit’s new end-of-turn model cuts false interruptions by 39%

Pipecat exposes new Gemini Live audio models for browser voice agents

MiniMax voices arrive on Retell AI with <250 ms latency and 40+ languages

OpenBMB’s VoxCPM offers tokenizer-free TTS at 0.17× real-time

🎬 Generative video & visuals: actor swaps, Kling/Kandinsky, playbooks

Invideo launches Performances for high-fidelity cast and scene swaps

Leonardo + Kling 2.6 workflow promises cheap AI game cutscenes

Video Arena ranks Kling 2.6 Pro and Kandinsky 5.0 among top video models

ComfyUI showcases 3×3 Nano Banana Pro grid for product ads

🏗️ AI infra economics: GPUs, DC timelines, and debt risk

AI data center boom heads toward $10T with rising debt and glut risk

China mulls extra $70B in chip incentives as global AI race heats up

Nvidia weighs boosting H200 GPU output for China despite export fees

Oracle pushes some OpenAI-linked data center timelines from 2027 to 2028

Jeff Dean explains TPU strategy of reserving die area for speculative features

💼 Enterprise adoption & market share: BBVA, Menlo data, Go tier

BBVA rolls ChatGPT Enterprise out to 120,000 employees

Menlo: Anthropic jumps to 40% of enterprise LLM spend, OpenAI falls to 27%

ChatGPT Go expands in LATAM as a cheaper GPT-5 access tier

Pinterest says open models cut its AI costs by ~90%

Lightfield pitches AI-native CRM built around transcripts, not fields

🧠 Reasoning training: targeted credit, longer RL runs, async thinking

Asynchronous Reasoning paper turns “thinking LLMs” into real‑time agents

HICRA RL paper shows planning‑token credit boosts LLM reasoning

Olmo 3.1 32B Think shows long RL runs keep paying off

OPV verifier framework lifts Olympiad‑level math agent performance

Decoupled Q‑Chunking scales RL with long critic chunks, short actors

🧪 Fresh models: sparse circuits, Devstral in Ollama, mobile UI agent

Mistral’s Devstral 2 lands in Ollama with 24B and 123B variants

Zhipu’s AutoGLM targets smartphone UI understanding and on‑device agents

OpenAI drops 0.4B "circuit-sparsity" model on Hugging Face

📉 Builder sentiment: speed/cost tradeoffs and model choices

Builders standardize on Opus 4.5 for speed, GPT‑5.2 for hard audits

Engineers push for cost‑ and latency‑aware evals, not xhigh‑only bragging

Users start cancelling ChatGPT Plus over GPT‑5.2 ‘vibe’ shift

Multi‑agent handoff hype meets skepticism over token waste and degradation

Open‑source and smaller models gain favor as teams chase 90% cost cuts

Builders tire of ‘X is finished’ lab wars and focus on fit‑for‑purpose models

🤖 Embodied stacks: video world models, monocular mocap, care robots

DeepMind validates Veo-based world models as reliable stand‑in for robot trials

MoCapAnything offers unified 3D motion capture from a single RGB video

RobotGym’s Qijia Q1 sketches a dual wheelchair–humanoid robot for elder care

🛡️ Policy & IP: one rulebook push and content licensing friction

US executive order moves to preempt state AI laws with one national rulebook

Disney’s cease‑and‑desist accuses Google Gemini of massive‑scale IP misuse

🧭 Search & embeddings: CPU‑fast lexical embeddings, eval hooks

Luxical-One debuts as CPU-fast lexical embedding model ~97× Qwen throughput

Luxical-One gains Sentence Transformers integration and path to MTEB

On this page