
OpenAI GPT‑5.2 multiplies tokens 6× on agents – 1T‑token debut
Stay in the loop
Free daily newsletter & Telegram daily report
Executive Summary
Two days after OpenAI’s GPT‑5.2 launch, independent evals are filling in the fine print. On Artificial Analysis’ GDPval‑AA run, GPT‑5.2 xhigh tops the agentic leaderboard at 1474 ELO, edging Claude Opus 4.5 at 1413, but burns ~250M tokens and ~$610 across 220 tasks—over 6× GPT‑5.1’s usage.
ARC‑AGI‑2 tells the same story: GPT‑5.2 high reaches ~52.9% at ~$1.39 per problem vs GPT‑5’s 10% at $0.73, and xhigh and Pro runs go higher still. Meanwhile on SimpleBench’s trick questions, GPT‑5.2 base scores 45.8% and Pro 57.4%, trailing GPT‑5 Pro and far behind Gemini 3 Pro at 76.4%, reinforcing that “more thinking” mainly helps long, structured work.
Usage doesn’t seem scared off: Sam Altman says GPT‑5.2 cleared 1T API tokens on day one despite roughly 40% higher list prices. VendingBench‑2 and Epoch’s ECI put it in the same long‑horizon league as Gemini 3 Pro and Opus 4.5 rather than a runaway winner. The practical takeaway: treat xhigh reasoning like a surgical tool for quarterly plans, audits, and gnarly debugging, while cheaper, faster models handle your inner loops.
Top links today
- GPT-5.2 long-context capabilities overview
- OpenAI circuit sparsity research post
- Stirrup agentic harness and GDPval tools
- Building Sora Android app with Codex
- Tinker GA launch and model support
- Olmo 3.1 Think 32B model artifacts
- Olmo 3 and 3.1 technical paper
- VoxCPM open-source TTS model card
- Luxical-one fast static embeddings blog
- AutoGLM smartphone UI agent model
- Gemini 2.5 Flash Native Audio update
- HICRA hierarchy-aware RL for LLM reasoning
- Asynchronous Reasoning training-free thinking LLMs
- InternGeometry olympiad geometry agent paper
- Causal-HalBench LVLM object hallucinations
Feature Spotlight
Feature: GPT‑5.2 reality check—evals vs cost and latency
GPT‑5.2 leads GDPval‑AA but burns ~250M tokens and ~$608 per run; underperforms on SimpleBench and is slow at xhigh. Engineers must weigh accuracy gains against cost/latency for real agent workflows.
Day‑two picture for GPT‑5.2: strong agentic task wins but mixed third‑party benchmarks and high token spend. Threads focus on SimpleBench underperformance, GDPval-AA costs, and heavy xhigh reasoning latency for AGI‑style gains.
Jump to Feature: GPT‑5.2 reality check—evals vs cost and latency topics