GPT‑5.2‑Codex hits 56.4% SWE‑Bench Pro – gated cyber access

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

OpenAI’s week is about agents, not chat: GPT‑5.2‑Codex is now the default coding brain in Codex and paid ChatGPT, tuned for long‑horizon refactors, tighter Windows/tool use, and context compaction so multi‑hour sessions don’t blow your token budget. On OpenAI’s own numbers it edges GPT‑5.2 and 5.1‑Codex‑Max on real‑world suites, hitting 56.4% on SWE‑Bench Pro and 64.0% on Terminal‑Bench 2.0, which map much closer to “did this actually fix the repo?” than to toy LeetCode.

The twist is cyber. 5.2‑Codex sits at the top of their internal professional CTF evals with pass@12 clustered around 85–90%, so OpenAI is turning it on inside Codex but slow‑rolling API access and spinning up an invite‑only “trusted access” track for defensive teams. That follows a recent CVE‑2025‑55183 React exploit that a researcher co‑developed with 5.1‑Codex‑Max, and it’s clear they don’t want that workflow in every random pentest bot.

Codex CLI 0.74 makes 5.2‑Codex the default with per‑run reasoning effort knobs and a configurable sandbox; devs report “medium” effort covering ~85% of work and x‑high rescuing gnarly bugs Opus 4.5 couldn’t crack in an hour. But 5.2‑Codex underperforms 5.1‑Codex‑Max on MLE‑Bench‑30, and repo hints of a “caribou” flagship suggest a 5.2‑Codex‑Max tier is already brewing. Treat this as your new baseline model—then keep workload‑specific evals and routing in the loop.

Feature: GPT‑5.2‑Codex lands for agentic coding and cyber

OpenAI ships GPT‑5.2‑Codex in Codex: SOTA agentic coding with 56.4% SWE‑Bench Pro and 64.0% Terminal‑Bench 2.0, context compaction for long runs, stronger Windows/vision, and an invite‑only cyber program.

Cross‑account launch with docs, CLIs and early dev reports. Focus is long‑horizon coding (context compaction), better Windows/tool use, vision‑aided code reading, and a new trusted‑access path for defensive cybersecurity.

Jump to Feature: GPT‑5.2‑Codex lands for agentic coding and cyber topics

🧠 Feature: GPT‑5.2‑Codex lands for agentic coding and cyber

OpenAI turns on GPT-5.2‑Codex for long‑horizon agentic coding

OpenAI has launched GPT‑5.2‑Codex as the new default agentic coding model inside Codex for all paid ChatGPT users, positioning it as their best option so far for complex, long‑running software work and defensive security tasks, with API access promised “soon.” launch tweet dev launch

The model is a GPT‑5.2 variant tuned specifically for coding and terminal use, adding native context compaction so multi‑hour sessions don’t blow past context limits, stronger long‑context reasoning, better Windows support, and improved tool‑calling for shells, editors, and test runners. gdb launch note This also extends to vision: GPT‑5.2‑Codex can parse screenshots, diagrams, and UI layouts more reliably, which matters for folks who debug from error dialogs or design mocks. OpenAI blog post In practice, you invoke it via the Codex CLI as codex -m gpt-5.2-codex, or pick it in the Codex model menu, where it’s now the top option and wired to multiple reasoning‑effort levels (low→x‑high) for different workloads. (cli usage, cli picker screenshot) For now it’s strictly available through Codex and ChatGPT for paying users, with OpenAI explicitly saying they’ll roll the API out more slowly than prior Codex models due to the model’s higher cybersecurity capability. (cli usage, cyber capability)

GPT‑5.2‑Codex hits 56.4% SWE‑Bench Pro – gated cyber access

Executive Summary

Top links today

Feature: GPT‑5.2‑Codex lands for agentic coding and cyber

Table of Contents

🧠 Feature: GPT‑5.2‑Codex lands for agentic coding and cyber

OpenAI turns on GPT-5.2‑Codex for long‑horizon agentic coding

GPT‑5.2‑Codex posts SOTA on SWE‑Bench Pro and Terminal‑Bench 2.0

GPT‑5.2‑Codex pushes cyber skills, OpenAI adds trusted‑access track

Codex CLI 0.74 makes GPT‑5.2‑Codex the default and exposes sandbox tuning

Builders report GPT‑5.2‑Codex fixing bugs faster with a terser personality

Caribou presets hint at a GPT‑5.2‑Codex‑Max follow‑on

🛠️ Coding agents and dev tooling in practice

RepoPrompt 1.5.57 adds full CLI and MCP slash commands for repo-scale agents

Amp editor adds review agent as teams call code review the new bottleneck

Toad becomes a universal ACP terminal for Claude, Codex, Gemini and more

LangChain’s Deepagent-CLI leans on reflection to update agent memory over time

Notte’s Demonstrate Mode turns recorded browser sessions into automation code

Warp adds Python and Node environment chips to cut debugging friction

WarpGrep positions itself as a specialized MCP search agent for Claude Code

Claude Code experiments with shareable sessions for showcasing agent runs

Cline partners with LG CNS to build an “AI-native” enterprise dev flow

Yutori Scouts adds suggested query tweaks and privacy controls for web agents

🧩 Interoperability: Agent Skills and AG‑UI momentum

Anthropic publishes Agent Skills open standard and enterprise Skills Directory

Open-source frameworks and agents quickly adopt the Agent Skills standard

Oracle’s Open Agent Spec adopts AG‑UI protocol via CopilotKit

📊 Leaderboards and eval suites: cost, tools, and rankings

Arena Search leaderboard adds GPT‑5.2‑Search and Grok‑4.1‑Fast‑Search

Gemini 3 Flash consolidates its lead across multiple third-party eval suites

GPT‑5.2 and Olmo‑3.1‑32B‑Think join Arena’s text model rankings

GSO-Bench: GPT‑5.2‑high + OpenHands leads optimization speedup rankings

K2‑V2 tops Artificial Analysis openness index with competitive intelligence

Scale AI releases Audio MultiChallenge for multi-turn spoken-dialogue evals

Vals and AA paint a mixed but clearer eval picture for Gemini 3 Flash

Arena open-sources Arena-Rank, the paired-comparison ranking engine behind LMArena

Exa publishes People Search benchmark for role- and location-aware profile retrieval

MLE-Bench-30 shows GPT‑5.2-Codex underperforming 5.1-Codex-Max on ML engineering tasks

⚙️ Serving and runtime: MoE scaling, hybrid APIs, embedded media

vLLM wide-EP MoE hits ~2.2k tok/s per H200 GPU on multi-node runs

SGLang adds Ollama-compatible API to bridge local and cloud inference

FunctionGemma lands in Ollama for tiny on-device function-calling agents

LiveKit and Espressif ship ESP32 SDK for realtime AV on microcontrollers

Warp adds Python and Node environment chips for quicker debug context

🏗️ AI infrastructure build‑out and power economics

DOE’s Genesis Mission signs AI lab MOUs to double scientific output

Epoch: GPUs only ~40% of AI datacenter power once overheads included

64 GB DDR5 kit spikes from ~$150 to ~$800 in three months

📥 Data plumbing: OCR, parsing tiers and robust JSON

Mistral OCR 3 pushes SoTA document understanding at $1–2 per 1k pages

Firecrawl launches /agent for autonomous web navigation and dataset extraction

OpenRouter ships Response Healing to auto‑repair malformed JSON from LLMs

NotebookLM adds Data Table artefact and one‑click export to Sheets

💼 Capital flows and enterprise platform moves

OpenAI explores tens of billions in new funding at ~$750B valuation

Amazon in talks to invest ≥$10B in OpenAI and supply Trainium chips

Lovable raises $330M Series B at $6.6B to scale no‑code AI app platform

Amazon restructures AI into unified AGI division under Peter DeSantis

OpenAI sells 700k+ ChatGPT licenses across ~35 US public universities

Morningstar and PitchBook ship financial data apps into ChatGPT

Perplexity launches desktop‑grade iPad app for AI research and browsing

Sora video app rolls out across 10 Latin American countries

🎙️ Realtime voice stacks: accuracy, controls and hosting

Grok Voice Agent tops speech reasoning benchmark and posts strong latency profile

Scale AI releases Audio MultiChallenge benchmark for multi‑turn voice reasoning

Together AI hosts Rime Arcana v2 and Mist v2 for production TTS

ElevenLabs adds Versioning control plane for voice agents

LiveKit and Espressif ship ESP32 SDK for tiny realtime AI voice endpoints

🎬 Creative media pipelines and layered edits

Qwen-Image-Layered auto-splits images into editable RGBA layers

Bria Video Eraser lands on fal for object and person removal

GPT-Image-1.5 plus Kling 2.5 emerges as a character-consistent video stack

Higgsfield Cinema Studio brings camera, lens and move presets to browser video

LMArena spins up community A/Bs of GPT-Image-1.5 vs Nano Banana Pro

Luma launches Ray 3 Modify for character swaps in existing videos

ComfyUI adds Template Library of real-world workflows and open-source graphs

Gamma bakes Nano Banana Pro into its deck builder with sharp in-slide text

Runway Gen-4.5 video model appears inside Adobe Firefly Boards

Seedance Pro 1.5 claims new SOTA lip-sync for character video

📑 Methods and agents: memory, long‑context and hybrid decoding