GPT‑5.2‑Codex hits 56.4% SWE‑Bench Pro – gated cyber access feature image for Thu, Dec 18, 2025

GPT‑5.2‑Codex hits 56.4% SWE‑Bench Pro – gated cyber access

Stay in the loop

Free daily newsletter & Telegram daily report

Join Telegram Channel

Executive Summary

OpenAI’s week is about agents, not chat: GPT‑5.2‑Codex is now the default coding brain in Codex and paid ChatGPT, tuned for long‑horizon refactors, tighter Windows/tool use, and context compaction so multi‑hour sessions don’t blow your token budget. On OpenAI’s own numbers it edges GPT‑5.2 and 5.1‑Codex‑Max on real‑world suites, hitting 56.4% on SWE‑Bench Pro and 64.0% on Terminal‑Bench 2.0, which map much closer to “did this actually fix the repo?” than to toy LeetCode.

The twist is cyber. 5.2‑Codex sits at the top of their internal professional CTF evals with pass@12 clustered around 85–90%, so OpenAI is turning it on inside Codex but slow‑rolling API access and spinning up an invite‑only “trusted access” track for defensive teams. That follows a recent CVE‑2025‑55183 React exploit that a researcher co‑developed with 5.1‑Codex‑Max, and it’s clear they don’t want that workflow in every random pentest bot.

Codex CLI 0.74 makes 5.2‑Codex the default with per‑run reasoning effort knobs and a configurable sandbox; devs report “medium” effort covering ~85% of work and x‑high rescuing gnarly bugs Opus 4.5 couldn’t crack in an hour. But 5.2‑Codex underperforms 5.1‑Codex‑Max on MLE‑Bench‑30, and repo hints of a “caribou” flagship suggest a 5.2‑Codex‑Max tier is already brewing. Treat this as your new baseline model—then keep workload‑specific evals and routing in the loop.

Top links today

Feature Spotlight

Feature: GPT‑5.2‑Codex lands for agentic coding and cyber

OpenAI ships GPT‑5.2‑Codex in Codex: SOTA agentic coding with 56.4% SWE‑Bench Pro and 64.0% Terminal‑Bench 2.0, context compaction for long runs, stronger Windows/vision, and an invite‑only cyber program.

Cross‑account launch with docs, CLIs and early dev reports. Focus is long‑horizon coding (context compaction), better Windows/tool use, vision‑aided code reading, and a new trusted‑access path for defensive cybersecurity.

Jump to Feature: GPT‑5.2‑Codex lands for agentic coding and cyber topics

Table of Contents

🧠 Feature: GPT‑5.2‑Codex lands for agentic coding and cyber

OpenAI turns on GPT-5.2‑Codex for long‑horizon agentic coding

GPT‑5.2‑Codex posts SOTA on SWE‑Bench Pro and Terminal‑Bench 2.0

GPT‑5.2‑Codex pushes cyber skills, OpenAI adds trusted‑access track

Codex CLI 0.74 makes GPT‑5.2‑Codex the default and exposes sandbox tuning

Builders report GPT‑5.2‑Codex fixing bugs faster with a terser personality

Caribou presets hint at a GPT‑5.2‑Codex‑Max follow‑on


🛠️ Coding agents and dev tooling in practice

RepoPrompt 1.5.57 adds full CLI and MCP slash commands for repo-scale agents

Amp editor adds review agent as teams call code review the new bottleneck

Toad becomes a universal ACP terminal for Claude, Codex, Gemini and more

LangChain’s Deepagent-CLI leans on reflection to update agent memory over time

Notte’s Demonstrate Mode turns recorded browser sessions into automation code

Warp adds Python and Node environment chips to cut debugging friction

WarpGrep positions itself as a specialized MCP search agent for Claude Code

Claude Code experiments with shareable sessions for showcasing agent runs

Cline partners with LG CNS to build an “AI-native” enterprise dev flow

Yutori Scouts adds suggested query tweaks and privacy controls for web agents


🧩 Interoperability: Agent Skills and AG‑UI momentum

Anthropic publishes Agent Skills open standard and enterprise Skills Directory

Open-source frameworks and agents quickly adopt the Agent Skills standard

Oracle’s Open Agent Spec adopts AG‑UI protocol via CopilotKit


📊 Leaderboards and eval suites: cost, tools, and rankings

Gemini 3 Flash consolidates its lead across multiple third-party eval suites

GPT‑5.2 and Olmo‑3.1‑32B‑Think join Arena’s text model rankings

GSO-Bench: GPT‑5.2‑high + OpenHands leads optimization speedup rankings

K2‑V2 tops Artificial Analysis openness index with competitive intelligence

Scale AI releases Audio MultiChallenge for multi-turn spoken-dialogue evals

Vals and AA paint a mixed but clearer eval picture for Gemini 3 Flash

Arena open-sources Arena-Rank, the paired-comparison ranking engine behind LMArena

Exa publishes People Search benchmark for role- and location-aware profile retrieval

MLE-Bench-30 shows GPT‑5.2-Codex underperforming 5.1-Codex-Max on ML engineering tasks


⚙️ Serving and runtime: MoE scaling, hybrid APIs, embedded media

vLLM wide-EP MoE hits ~2.2k tok/s per H200 GPU on multi-node runs

SGLang adds Ollama-compatible API to bridge local and cloud inference

FunctionGemma lands in Ollama for tiny on-device function-calling agents

LiveKit and Espressif ship ESP32 SDK for realtime AV on microcontrollers

Warp adds Python and Node environment chips for quicker debug context


🏗️ AI infrastructure build‑out and power economics

DOE’s Genesis Mission signs AI lab MOUs to double scientific output

Epoch: GPUs only ~40% of AI datacenter power once overheads included

64 GB DDR5 kit spikes from ~$150 to ~$800 in three months


📥 Data plumbing: OCR, parsing tiers and robust JSON

Mistral OCR 3 pushes SoTA document understanding at $1–2 per 1k pages

Firecrawl launches /agent for autonomous web navigation and dataset extraction

OpenRouter ships Response Healing to auto‑repair malformed JSON from LLMs

NotebookLM adds Data Table artefact and one‑click export to Sheets


💼 Capital flows and enterprise platform moves

OpenAI explores tens of billions in new funding at ~$750B valuation

Amazon in talks to invest ≥$10B in OpenAI and supply Trainium chips

Lovable raises $330M Series B at $6.6B to scale no‑code AI app platform

Amazon restructures AI into unified AGI division under Peter DeSantis

OpenAI sells 700k+ ChatGPT licenses across ~35 US public universities

Morningstar and PitchBook ship financial data apps into ChatGPT

Perplexity launches desktop‑grade iPad app for AI research and browsing

Sora video app rolls out across 10 Latin American countries


🎙️ Realtime voice stacks: accuracy, controls and hosting

Grok Voice Agent tops speech reasoning benchmark and posts strong latency profile

Scale AI releases Audio MultiChallenge benchmark for multi‑turn voice reasoning

Together AI hosts Rime Arcana v2 and Mist v2 for production TTS

ElevenLabs adds Versioning control plane for voice agents

LiveKit and Espressif ship ESP32 SDK for tiny realtime AI voice endpoints


🎬 Creative media pipelines and layered edits

Qwen-Image-Layered auto-splits images into editable RGBA layers

Bria Video Eraser lands on fal for object and person removal

GPT-Image-1.5 plus Kling 2.5 emerges as a character-consistent video stack

Higgsfield Cinema Studio brings camera, lens and move presets to browser video

LMArena spins up community A/Bs of GPT-Image-1.5 vs Nano Banana Pro

Luma launches Ray 3 Modify for character swaps in existing videos

ComfyUI adds Template Library of real-world workflows and open-source graphs

Gamma bakes Nano Banana Pro into its deck builder with sharp in-slide text

Runway Gen-4.5 video model appears inside Adobe Firefly Boards

Seedance Pro 1.5 claims new SOTA lip-sync for character video


📑 Methods and agents: memory, long‑context and hybrid decoding

MEM1 trains constant‑memory long‑horizon agents with 3.5× better performance

Meta’s qTTT test‑time training boosts long‑context Qwen3‑4B by 12–14 points

CANOE uses synthetic QA + RL to cut RAG hallucinations and beat GPT‑4o

CANOE’s Dual‑GRPO RL unifies memory and reasoning for faithful long‑context use

DEER proposes draft‑with‑diffusion, verify‑with‑AR decoding for language models

NBDiff‑7B adapts AR LLMs into block‑diffusion models with 78.8 avg score

IC‑Effect shows precise video effects editing with pure in‑context learning

Model‑first reasoning agents cut hallucinations by forcing explicit world models

SAGE trains any‑horizon video agents that beat baselines by ~6–8%


🛡️ Safety, monitoring and content provenance

GPT‑5.2‑Codex boosts cyber capabilities as OpenAI gates access to defenders

OpenAI publishes chain-of-thought monitorability evals across 24 environments

Reasoning‑style poisoning paper shows style-only attacks and RSV monitors

Gemini app now checks SynthID watermarks in images and video segments

OpenAI updates Model Spec with explicit under‑18 safety principles

Anthropic details Claude’s approach to emotional support and crisis safety


🤖 Affordable humanoids and open desktop robots

LimX’s $6.8k TRON2 humanoid targets large‑scale real‑world data collection

Reachy Mini kits hit builders’ desks via Hugging Face tie‑in

On this page

Executive Summary
Feature Spotlight: Feature: GPT‑5.2‑Codex lands for agentic coding and cyber
🧠 Feature: GPT‑5.2‑Codex lands for agentic coding and cyber
OpenAI turns on GPT-5.2‑Codex for long‑horizon agentic coding
GPT‑5.2‑Codex posts SOTA on SWE‑Bench Pro and Terminal‑Bench 2.0
GPT‑5.2‑Codex pushes cyber skills, OpenAI adds trusted‑access track
Codex CLI 0.74 makes GPT‑5.2‑Codex the default and exposes sandbox tuning
Builders report GPT‑5.2‑Codex fixing bugs faster with a terser personality
Caribou presets hint at a GPT‑5.2‑Codex‑Max follow‑on
🛠️ Coding agents and dev tooling in practice
RepoPrompt 1.5.57 adds full CLI and MCP slash commands for repo-scale agents
Amp editor adds review agent as teams call code review the new bottleneck
Toad becomes a universal ACP terminal for Claude, Codex, Gemini and more
LangChain’s Deepagent-CLI leans on reflection to update agent memory over time
Notte’s Demonstrate Mode turns recorded browser sessions into automation code
Warp adds Python and Node environment chips to cut debugging friction
WarpGrep positions itself as a specialized MCP search agent for Claude Code
Claude Code experiments with shareable sessions for showcasing agent runs
Cline partners with LG CNS to build an “AI-native” enterprise dev flow
Yutori Scouts adds suggested query tweaks and privacy controls for web agents
🧩 Interoperability: Agent Skills and AG‑UI momentum
Anthropic publishes Agent Skills open standard and enterprise Skills Directory
Open-source frameworks and agents quickly adopt the Agent Skills standard
Oracle’s Open Agent Spec adopts AG‑UI protocol via CopilotKit
📊 Leaderboards and eval suites: cost, tools, and rankings
Arena Search leaderboard adds GPT‑5.2‑Search and Grok‑4.1‑Fast‑Search
Gemini 3 Flash consolidates its lead across multiple third-party eval suites
GPT‑5.2 and Olmo‑3.1‑32B‑Think join Arena’s text model rankings
GSO-Bench: GPT‑5.2‑high + OpenHands leads optimization speedup rankings
K2‑V2 tops Artificial Analysis openness index with competitive intelligence
Scale AI releases Audio MultiChallenge for multi-turn spoken-dialogue evals
Vals and AA paint a mixed but clearer eval picture for Gemini 3 Flash
Arena open-sources Arena-Rank, the paired-comparison ranking engine behind LMArena
Exa publishes People Search benchmark for role- and location-aware profile retrieval
MLE-Bench-30 shows GPT‑5.2-Codex underperforming 5.1-Codex-Max on ML engineering tasks
⚙️ Serving and runtime: MoE scaling, hybrid APIs, embedded media
vLLM wide-EP MoE hits ~2.2k tok/s per H200 GPU on multi-node runs
SGLang adds Ollama-compatible API to bridge local and cloud inference
FunctionGemma lands in Ollama for tiny on-device function-calling agents
LiveKit and Espressif ship ESP32 SDK for realtime AV on microcontrollers
Warp adds Python and Node environment chips for quicker debug context
🏗️ AI infrastructure build‑out and power economics
DOE’s Genesis Mission signs AI lab MOUs to double scientific output
Epoch: GPUs only ~40% of AI datacenter power once overheads included
64 GB DDR5 kit spikes from ~$150 to ~$800 in three months
📥 Data plumbing: OCR, parsing tiers and robust JSON
Mistral OCR 3 pushes SoTA document understanding at $1–2 per 1k pages
Firecrawl launches /agent for autonomous web navigation and dataset extraction
OpenRouter ships Response Healing to auto‑repair malformed JSON from LLMs
NotebookLM adds Data Table artefact and one‑click export to Sheets
💼 Capital flows and enterprise platform moves
OpenAI explores tens of billions in new funding at ~$750B valuation
Amazon in talks to invest ≥$10B in OpenAI and supply Trainium chips
Lovable raises $330M Series B at $6.6B to scale no‑code AI app platform
Amazon restructures AI into unified AGI division under Peter DeSantis
OpenAI sells 700k+ ChatGPT licenses across ~35 US public universities
Morningstar and PitchBook ship financial data apps into ChatGPT
Perplexity launches desktop‑grade iPad app for AI research and browsing
Sora video app rolls out across 10 Latin American countries
🎙️ Realtime voice stacks: accuracy, controls and hosting
Grok Voice Agent tops speech reasoning benchmark and posts strong latency profile
Scale AI releases Audio MultiChallenge benchmark for multi‑turn voice reasoning
Together AI hosts Rime Arcana v2 and Mist v2 for production TTS
ElevenLabs adds Versioning control plane for voice agents
LiveKit and Espressif ship ESP32 SDK for tiny realtime AI voice endpoints
🎬 Creative media pipelines and layered edits
Qwen-Image-Layered auto-splits images into editable RGBA layers
Bria Video Eraser lands on fal for object and person removal
GPT-Image-1.5 plus Kling 2.5 emerges as a character-consistent video stack
Higgsfield Cinema Studio brings camera, lens and move presets to browser video
LMArena spins up community A/Bs of GPT-Image-1.5 vs Nano Banana Pro
Luma launches Ray 3 Modify for character swaps in existing videos
ComfyUI adds Template Library of real-world workflows and open-source graphs
Gamma bakes Nano Banana Pro into its deck builder with sharp in-slide text
Runway Gen-4.5 video model appears inside Adobe Firefly Boards
Seedance Pro 1.5 claims new SOTA lip-sync for character video
📑 Methods and agents: memory, long‑context and hybrid decoding
MEM1 trains constant‑memory long‑horizon agents with 3.5× better performance
Meta’s qTTT test‑time training boosts long‑context Qwen3‑4B by 12–14 points
CANOE uses synthetic QA + RL to cut RAG hallucinations and beat GPT‑4o
CANOE’s Dual‑GRPO RL unifies memory and reasoning for faithful long‑context use
DEER proposes draft‑with‑diffusion, verify‑with‑AR decoding for language models
NBDiff‑7B adapts AR LLMs into block‑diffusion models with 78.8 avg score
IC‑Effect shows precise video effects editing with pure in‑context learning
Model‑first reasoning agents cut hallucinations by forcing explicit world models
SAGE trains any‑horizon video agents that beat baselines by ~6–8%
🛡️ Safety, monitoring and content provenance
GPT‑5.2‑Codex boosts cyber capabilities as OpenAI gates access to defenders
OpenAI publishes chain-of-thought monitorability evals across 24 environments
Reasoning‑style poisoning paper shows style-only attacks and RSV monitors
Gemini app now checks SynthID watermarks in images and video segments
OpenAI updates Model Spec with explicit under‑18 safety principles
Anthropic details Claude’s approach to emotional support and crisis safety
🤖 Affordable humanoids and open desktop robots
LimX’s $6.8k TRON2 humanoid targets large‑scale real‑world data collection
Reachy Mini kits hit builders’ desks via Hugging Face tie‑in