OpenAI GPT‑5‑Codex – 74.5% SWE‑bench; 7‑hour autonomy across IDEs
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Executive Summary
OpenAI’s GPT‑5‑Codex lands with dynamic thinking and real staying power. It runs independently for 7+ hours on complex refactors and tests while pushing SWE‑bench Verified to 74.5%. Refactor accuracy climbs to 51.3%, and early users report snappier small edits with denser, higher‑value review comments.
In numbers:
- Benchmarks: 74.5% SWE‑bench Verified vs 72.8% (GPT‑5); refactors 51.3% vs 33.9%
- Dynamic effort: 93.7% fewer tokens on easy turns; 102.2% more on hardest 10%
- Autonomy: 7+ hours on complex runs; faster replies on straightforward edits
- Adoption: ~40% of Codex traffic today; trending to majority, per OpenAI
- Codex Cloud: cached containers cut median completion time ~90% on remote tasks
- CLI v0.36: resume, images, to‑dos, MCP; streaming errors patched quickly
Also:
- Xcode 26 adds GPT‑5 and Claude Sonnet 4 assistants inside IDE workflows
- OpenAI API throughput: GPT‑5 to 4M TPM; GPT‑5‑mini starts at 500k TPM
- LiveMCPBench covers 70 MCP servers and 527 tools; reliability benchmarking expands
- AGENTS.md adopted across Codex, Amp, Cursor, Jules, RooCode; one spec over 5+ stacks
Feature Spotlight
Agentic Coding: GPT-5‑Codex Launch
OpenAI ships GPT‑5‑Codex: a coding‑optimized GPT‑5 that adapts thinking time, works autonomously for hours, and lifts refactor accuracy (51.3%). Now in CLI/IDE/GitHub—this resets the coding‑agent bar.
Major OpenAI drop: GPT‑5‑Codex across CLI/IDE/Web/GitHub with dynamic thinking, multi‑hour autonomy, and large refactor gains. Broad, cross‑account coverage today.
- OpenAI blog: GPT‑5‑Codex release with dynamic reasoning (−93.7% tokens on easy, +102.2% on hard), 7+ hr runs, SWE‑bench 74.5% vs GPT‑5 72.8%
- Codex 0.36.0 CLI: new codex resume, images, to‑dos, MCP, web search; IDE model picker and cloud handoff UX refresh
- GitHub code review bot metrics: 4.4% incorrect comments vs 13.7% (GPT‑5), 52.4% high‑impact comments, 0.93 comments/PR
- System prompt leak: AGENTS.md precedence, commit rules, citations, screenshot tool, PR requirements
- Xcode 26 integration: GPT‑5 built‑in; Anthropic announces Claude Sonnet 4 in Xcode; developer competition narrative
- OpenAI podcast + AMA: gdb and Codex lead on leap from autocomplete to agents; employee traffic shift to GPT‑5‑Codex
📑 Table of Contents
🦿 Robotics and Embodied AI
Embodied stacks and progress: NavFoM foundation model across robots, Unitree precision demos, Figure teasers, OpenAI robotics hiring, safety ‘violence tests’.
OpenAI ramps robotics hiring with teleoperation and Isaac Sim focus
OpenAI is staffing up for robotics with roles spanning teleoperation (teaching via remote control) and simulation built on NVIDIA Isaac Sim, plus mechanical roles that hint at large‑scale hardware ambitions (1M+ unit language in postings), per a detailed roundup Wired report. The push follows OpenAI’s prior sim‑to‑real work (e.g., Dactyl) and lands amid a broader industry arms race in embodied AI; researchers like Sergey Levine argue deployed experience will kick off a robotics flywheel within single‑digit years robotics flywheel, positioning vs 2009.
NavFoM unifies navigation across drones, quadrupeds and vehicles with 8M‑sample training
Peking University’s NavFoM is an embodied navigation foundation model trained on 8M samples spanning quadrupeds, drones, wheeled robots and vehicles; it handles VLN, ObjNav, tracking and driving in one model, outperforming baselines and showing real‑world deployment demos project page, arXiv abstract, project site, ArXiv page. The vertically integrated setup (perception→policy) targets cross‑platform reuse—useful for teams seeking a single policy stack across platforms.
SUSTech “violence test” shows fall‑robust humanoid control under hard perturbations
Southern University of Science and Technology’s ACT Lab published a humanoid “violence test” that stress‑tests balance and recovery—demonstrating dynamic control that resists pushes and recovers from destabilizing hits lab video, follow‑up clip. Complementary clips highlight learned dynamic control in real‑time dynamic control. For robotics teams, it’s a concrete reference for robustifying locomotion policies beyond nominal gaits (test‑time perturbation tolerance).
Figure teases three days of announcements amid F.03 and Helix AI buzz
Figure CEO Brett Adcock says the company will unveil “three big announcements” over three days, fueling community speculation around F.03 hardware and Helix AI updates Adcock tease. Builders following humanoid stacks flagged the cadence as a signal of platform maturation (hardware, control stack, and agent loop) rumor roundup.
Unitree’s precision demos trend as company prepares to go public
A circulating clip underscores Unitree’s precision/stability capabilities—timely as the company is reported to be going public soon stability demo, context link. For embodied teams, Unitree’s cost/availability trajectory continues to make it a popular testbed for locomotion and manipulation research.
Tencent teases new Hunyuan 3D model after ultra‑long world model reveal
Tencent hinted at an upcoming Hunyuan 3D model teaser in context of World‑Voyager showcasing an ultra‑long‑range world model with native 3D reconstruction. Teams building sim‑to‑real pipelines should watch for improved 3D perception/generation that can feed policy learning and scene understanding.
🛡️ Security, Safety and Governance
Safety and governance threads: California SB‑53 AI safety bill, Anthropic DC push vs China adoption, news publisher lawsuits, chip trade probes, and prompt‑injection risks for net‑enabled agents.
California SB‑53 clears Legislature: AI safety disclosures, whistleblower protections and CalCompute now await Newsom
California’s SB‑53 passed both chambers and heads to the governor. The bill compels larger AI developers to disclose safety protocols, protects whistleblowers, and launches a state‑run compute cloud (CalCompute). Reporting scales by revenue (full transparency for $500M+ firms; lighter duties for smaller labs). It now awaits Gov. Newsom’s signature or veto amid concerns about conflicts with federal/EU rules bill recap.
Anthropic pitches DC: $1 Claude for Government/Enterprise for 1 year, FedRAMP High, and national security pilots
Anthropic urged faster US government AI rollout, warning China is moving quicker on public‑sector AI. To speed adoption, it’s offering Claude for Enterprise and Claude for Government at $1 for one year via a GSA OneGov deal, citing 100,000s of federal users already on Claude. It highlighted FedRAMP High posture, availability to national security users, work at Lawrence Livermore, and Pentagon CDAO pilots; other vendors are matching discounts to enable multi‑stack trials FedScoop writeup.
China opens dual probes into US chip policies and analog chip dumping ahead of new trade talks
Beijing launched two investigations: (1) alleged discrimination in US chip policy (export controls, tariffs, CHIPS Act) that blocks Chinese firms from advanced tech and markets; (2) alleged dumping of US analog chips in China at unfair prices. The timing coincides with new US‑China trade talks in Spain, escalating tech‑trade tensions that directly affect AI hardware supply chains chip probes image.
OpenAI flags usage policy update on September 17, 2025
OpenAI’s policies page shows an upcoming update dated Sep 17, 2025, directing readers to a new version. Organizations with compliance or governance controls should anticipate reviewing enforcement and scope changes once live and adjust internal guardrails accordingly policies page.
OpenAI warns Codex internet access can be prompt‑injected; recommends strict allowlists and limited HTTP methods
OpenAI’s Codex docs caution that enabling agent internet access introduces prompt‑injection, data exfiltration, malware, and licensing risks. Guidance: keep internet access off by default; when required, use narrow domain allowlists, prefer safe methods (GET/HEAD/OPTIONS), and consider preset allowlists for common dependencies to reduce blast radius internet access doc Codex internet access. This lands in context of MCP firewall (open‑source firewall to curb agent data exfiltration), extending defensive posture from tooling to configuration hardening.
🎨 Gen Media, Vision and Creative Stacks
Creative image/video pipelines and UGC tactics. Heavy Nano Banana/Gemini 2.5 virality, Seedream 4 demos, Comfy Cloud beta. Excludes HunyuanImage/Marey releases which are in Model Releases.
Higgsfield makes Soul image model free for everyone
Higgsfield opened its photorealistic Soul image model to all users at no cost, with creators showcasing CCTV‑style presets, 90s film vibes, and lifestyle shots in a hands‑on thread Launch thread. Access is confirmed by multiple posts and the official site Model now free Soul is free Higgsfield. The examples include step‑by‑step prompts (shared via ALT text) and workflows mixing Soul with video tools for animations Prompt pack. For teams evaluating image stacks, this materially lowers trial friction while raising expectations on realism-to-effort ratios.
Google pushes Nano Banana; Gemini tops iOS downloads and a Nano Banana ad passes 20M views
Gemini climbed to the most‑downloaded iOS app in the US iOS chart, with leadership signaling more to ship Demis congrats. A Nano Banana creative ad surpassed 20M views, highlighting image gen as a mass‑market onboarding hook Ad views. This comes in context of workflow that showed a Nano Banana storytelling pipeline; today’s push suggests image features are converting broader users faster than audio/video Google moving fast.
MoonValley debuts Marey video model; ~$18/min and licensed‑only training
MoonValley launched Marey for text‑to‑video and image‑to‑video, trained exclusively on licensed HD footage (no scraped or user uploads) Model overview. On the Artificial Analysis arena it ranks #12 for text‑to‑video and #21 for image‑to‑video, near Seedance 1.0 Mini and Vidu Q1 peers Prompt set Video arena. Pricing lands around $18/min both on MoonValley’s site and fal.ai per the summary Model overview. The thread also contrasts cost/quality against incumbents (e.g., Veo 3 No Audio at ~$30/min vs LTXV v0.9.7 at ~$1.20/min) to frame trade‑offs Model overview.
Seedream 4 + Kling 2.1: start/end‑frame video workflow for creators
A practical thread walks through a fast way to make viral clips: capture a real starting frame, generate an ending frame with Seedream 4 (prompt in ALT), then feed both into Kling 2.1 to animate; stitch outputs in any editor How‑to thread Kling step Editing tip. The post emphasizes minimal setup and provides prompts, positioning this as a repeatable template for short‑form social content Follow‑ups.
🗣️ Voice AI, STT/TTS and Productions
Speech tech in production: ElevenLabs STT deployment and new managed service combining AI + human editing. Mostly enterprise case studies and pricing.
ElevenLabs launches Productions, a managed dubbing/captions/transcripts/audiobooks service at $2/min
ElevenLabs rolled out Productions, an end‑to‑end, human‑in‑the‑loop service for creators and media teams covering dubbing, captions/subtitles (with SDH), transcripts and audiobooks, priced from $2.00/min with priority support and studio‑grade post‑processing Service launch. It’s already powering dubs for Dude Perfect and Andrew Huberman and is used on Hollywood film work Dubbing details. Ordering happens inside ElevenLabs accounts, and they’re recruiting a Producer network for linguists/localization pros Producer signup. Full offering and pricing are outlined in the product page ElevenLabs Productions. In context of hackathon voices where ElevenLabs amplified agent devs, this formalizes a managed pipeline for broadcast‑quality output with human QA. Capabilities by line of business: dubbing Dubbing details, captions/subtitles with accessibility tags Captions info, 99%‑accuracy transcripts backed by Scribe STT Transcripts info, and audiobook polishing Audiobooks workflow.
CARS24 transcribes 20k hours/month in 14 languages; +35% conversions, −40% disputes on ElevenLabs STT
CARS24 runs 20,000 hours/month of multilingual call audio (Hindi, English and code‑mixed dialects across 14 languages) through ElevenLabs Speech‑to‑Text on Google Cloud to flag issues in real time and generate ops insights Customer story. Reported impact: +35% first‑visit conversions, −40% evaluation disputes, +25% CSAT, and 50% faster resolutions, with accuracy claims up to 98.5% in their ambient capture setup ElevenLabs case study. The workflow spans encrypted pipelines, PII redaction, live quality monitoring and alerts, feeding hub performance dashboards Results summary.
ElevenLabs ships v0 Agents Starter to add voice agents to apps in minutes
ElevenLabs Devs released a v0 Agents Starter that lets teams configure an agent in ElevenLabs, clone a v0 starter template and converse with their app immediately—useful for quick PoCs and onboarding conversational features Starter intro. All starters are listed in one place with setup guidance Starters index, with an external quickstart link for the ElevenLabs v0 template Starter guide. This complements Productions and STT pipelines by lowering the lift to prototype voice‑first app flows.
🔌 Orchestration & MCP Interop
MCP connectors and standards, agent formats, and tool routing. Mostly MCP plugins, AGENTS.md standardization, and Live MCP tool‑calling evals. Excludes GPT‑5‑Codex feature items.
Codex CLI v0.36 adds MCP integration and safer internet access with approval modes
OpenAI’s Codex CLI update integrates MCP (with three approval modes), adds images and to‑dos, and tightens internet access controls. The official notes emphasize faster cloud envs and tool use at scale, with MCP as the interop backbone Release highlights. The new internet access docs warn plainly about prompt injection and data exfiltration, recommending strict allowlists and limited HTTP verbs for safety in agent runs Internet access docs. Early users hit stream errors that were patched quickly, signaling active hardening of the stack Stream error, Fix shipped. Together these changes make Codex a cleaner citizen of the MCP/tool ecosystem while reducing security foot‑guns.
LiveMCPBench launches: 70 MCP servers and 527 tools benchmarked for real-world coverage
A new wave of MCP tool‑calling benchmarks landed this week—headlined by LiveMCPBench covering 70 MCP servers and 527 tools—plus complementary suites for stability and failure modes MCP benchmarks. The emphasis shifts from prompt toys to end‑to‑end tool reliability, a key metric for agents spanning heterogeneous stacks. This arrives in context of MCP pair programming, where we saw practical MCP adoption for dev workflows, and helps teams choose robust servers and surface breakages early.
AGENTS.md gains traction as a cross‑agent contract; Codex prompt enforces it
The AGENTS.md pattern—machine‑readable repo instructions for coding agents—is consolidating. Community voices push to deprecate CLAUDE.md in favor of AGENTS.md across agents like Codex, Amp, Jules, Cursor, RooCode, and more Agent doc standard. OpenAI’s Codex system prompt explicitly instructs agents to discover and obey AGENTS.md, scoping precedence rules and programmatic checks for PRs Codex prompt leak. Changelog/UI updates reinforce multi‑environment behavior that benefits from a consistent agent contract Codex changelog. Result: fewer bespoke conventions and smoother tool interop across ecosystems.
GenKit adds MCP plugin for Go/JS to connect external servers or expose your own tools
Firebase’s GenKit now ships an MCP plugin for Go and JavaScript: connect to external MCP servers, manage multiple connections, and even expose your GenKit tools as an MCP server. The docs call out using test servers (time/everything) or custom ones, turning GenKit flows into first‑class tools inside agent stacks GenKit MCP plugin. This tightens interop between app logic and shared tool buses, reducing bespoke integrations across agent frameworks.
🧬 New Models & Open Weights
Model drops beyond coding: UI‑action VLMs, image/video models, open‑sourced reasoning and DP training. Mostly Holo1.5, HunyuanImage 2.1, Marey, ERNIE 4.5 A3B; MobileLLM‑R1 tease. Excludes GPT‑5‑Codex.
H Company ships Holo1.5 CU models (3B/7B/72B), 7B under Apache 2.0
H Company released Holo1.5, a family of computer‑use (CU) VLMs with +10% accuracy over Holo1 and SOTA UI localization/VQA; sizes are 3B (research‑only), 7B (Apache 2.0), and 72B (research‑only) Launch thread. Benchmarks claim it beats open baselines (Qwen‑2.5 VL), outperforms closed Sonnet 4 and tops specialized systems like UI‑TARS/Venus Benchmarks note. Models and docs are live on Hugging Face and the blog Hugging Face page H Company blog. Demos emphasize precise element localization (click X,Y) and UI VQA for reliable web/desktop/mobile navigation UI demo.
Google unveils VaultGemma 1B trained with differential privacy (ε≈2)
VaultGemma 1B is a DP‑trained LLM using DP‑SGD with Poisson sampling and new scaling laws tying compute/data/privacy to utility. Google reports a 50‑token prefix test showing no detectable memorization, with performance near GPT‑2 1.5B while preserving privacy (sequence‑level ε≈2, δ≈1.1e‑10 for 1024‑token units) Google blog post. The work details how larger batches/smaller models and tuning iterations stabilize training under DP noise Overview thread. Full blog explains the noise‑batch ratio as a key control knob for setting quality vs. privacy Scaling insights.
HunyuanImage 2.1 leads open‑weights image arena; tight license, $100/1k images
Tencent’s HunyuanImage 2.1 (17B DiT, native 2048×2048, bilingual) is now the top open‑weights text‑to‑image model on the Artificial Analysis arena, edging HiDream‑I1‑Dev and Qwen‑Image Arena thread. Availability and pricing: on fal at ~$100 per 1k images; a Hugging Face demo is live Arena link. The Tencent Community License restricts >100M MAU products, bars EU/UK/South Korea use, and forbids using outputs to improve non‑Hunyuan models Arena thread. See side‑by‑side prompts and results in the arena listing Image arena.
Baidu open‑sources ERNIE‑4.5‑21B‑A3B‑Thinking; #1 trending text‑gen on HF
Baidu released ERNIE‑4.5‑21B‑A3B‑Thinking, a mixture‑of‑experts model focused on strong reasoning. As of Sept 11, it ranked #1 among trending text‑generation models on Hugging Face Trending note, with community posts highlighting the open‑sourcing as “built for serious reasoning” Announcement recap.
Tencent SRPO alignment tops HF trending; boosts FLUX realism 3×
SRPO (Semantic Relative Preference Optimization) + Direct‑Align refines diffusion outputs without over‑optimizing late steps, with text‑conditioned rewards that reduce reliance on offline reward tuning. Applied to FLUX.1.dev, human‑rated realism/aesthetics improved over 3× in reported tests Model card link. It’s now #1 trending on Hugging Face Trending snapshot. Source assets include the model page and a live Space demo for quick trials HF model HF Space.
MobileLLM‑R1 (<1B) lands on Hugging Face for edge reasoning
The sub‑1B‑parameter, edge‑focused reasoning model MobileLLM‑R1 is now on Hugging Face HF drop. This follows the in‑browser rollout noted in browser demo, where it ran 100% locally via Transformers.js. Roundup threads also list it among recent notable releases Release roundup.