OpenAI GPT‑5‑Codex – 74.5% SWE‑bench; 7‑hour autonomy across IDEs

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Executive Summary

OpenAI’s GPT‑5‑Codex lands with dynamic thinking and real staying power. It runs independently for 7+ hours on complex refactors and tests while pushing SWE‑bench Verified to 74.5%. Refactor accuracy climbs to 51.3%, and early users report snappier small edits with denser, higher‑value review comments.

In numbers:

  • Benchmarks: 74.5% SWE‑bench Verified vs 72.8% (GPT‑5); refactors 51.3% vs 33.9%
  • Dynamic effort: 93.7% fewer tokens on easy turns; 102.2% more on hardest 10%
  • Autonomy: 7+ hours on complex runs; faster replies on straightforward edits
  • Adoption: ~40% of Codex traffic today; trending to majority, per OpenAI
  • Codex Cloud: cached containers cut median completion time ~90% on remote tasks
  • CLI v0.36: resume, images, to‑dos, MCP; streaming errors patched quickly

Also:

  • Xcode 26 adds GPT‑5 and Claude Sonnet 4 assistants inside IDE workflows
  • OpenAI API throughput: GPT‑5 to 4M TPM; GPT‑5‑mini starts at 500k TPM
  • LiveMCPBench covers 70 MCP servers and 527 tools; reliability benchmarking expands
  • AGENTS.md adopted across Codex, Amp, Cursor, Jules, RooCode; one spec over 5+ stacks

Feature Spotlight

Agentic Coding: GPT-5‑Codex Launch

OpenAI ships GPT‑5‑Codex: a coding‑optimized GPT‑5 that adapts thinking time, works autonomously for hours, and lifts refactor accuracy (51.3%). Now in CLI/IDE/GitHub—this resets the coding‑agent bar.

Major OpenAI drop: GPT‑5‑Codex across CLI/IDE/Web/GitHub with dynamic thinking, multi‑hour autonomy, and large refactor gains. Broad, cross‑account coverage today.

📑 Table of Contents

🦿 Robotics and Embodied AI

Embodied stacks and progress: NavFoM foundation model across robots, Unitree precision demos, Figure teasers, OpenAI robotics hiring, safety ‘violence tests’.

OpenAI ramps robotics hiring with teleoperation and Isaac Sim focus

OpenAI is staffing up for robotics with roles spanning teleoperation (teaching via remote control) and simulation built on NVIDIA Isaac Sim, plus mechanical roles that hint at large‑scale hardware ambitions (1M+ unit language in postings), per a detailed roundup Wired report. The push follows OpenAI’s prior sim‑to‑real work (e.g., Dactyl) and lands amid a broader industry arms race in embodied AI; researchers like Sergey Levine argue deployed experience will kick off a robotics flywheel within single‑digit years robotics flywheel, positioning vs 2009.

Peking University’s NavFoM is an embodied navigation foundation model trained on 8M samples spanning quadrupeds, drones, wheeled robots and vehicles; it handles VLN, ObjNav, tracking and driving in one model, outperforming baselines and showing real‑world deployment demos project page, arXiv abstract, project site, ArXiv page. The vertically integrated setup (perception→policy) targets cross‑platform reuse—useful for teams seeking a single policy stack across platforms.

SUSTech “violence test” shows fall‑robust humanoid control under hard perturbations

Southern University of Science and Technology’s ACT Lab published a humanoid “violence test” that stress‑tests balance and recovery—demonstrating dynamic control that resists pushes and recovers from destabilizing hits lab video, follow‑up clip. Complementary clips highlight learned dynamic control in real‑time dynamic control. For robotics teams, it’s a concrete reference for robustifying locomotion policies beyond nominal gaits (test‑time perturbation tolerance).

Figure teases three days of announcements amid F.03 and Helix AI buzz

Figure CEO Brett Adcock says the company will unveil “three big announcements” over three days, fueling community speculation around F.03 hardware and Helix AI updates Adcock tease. Builders following humanoid stacks flagged the cadence as a signal of platform maturation (hardware, control stack, and agent loop) rumor roundup.

Tencent teases new Hunyuan 3D model after ultra‑long world model reveal

Tencent hinted at an upcoming Hunyuan 3D model teaser in context of World‑Voyager showcasing an ultra‑long‑range world model with native 3D reconstruction. Teams building sim‑to‑real pipelines should watch for improved 3D perception/generation that can feed policy learning and scene understanding.

Unitree’s precision demos trend as company prepares to go public

A circulating clip underscores Unitree’s precision/stability capabilities—timely as the company is reported to be going public soon stability demo, context link. For embodied teams, Unitree’s cost/availability trajectory continues to make it a popular testbed for locomotion and manipulation research.


🎨 Gen Media, Vision and Creative Stacks

Creative image/video pipelines and UGC tactics. Heavy Nano Banana/Gemini 2.5 virality, Seedream 4 demos, Comfy Cloud beta. Excludes HunyuanImage/Marey releases which are in Model Releases.

Higgsfield makes Soul image model free for everyone

Higgsfield opened its photorealistic Soul image model to all users at no cost, with creators showcasing CCTV‑style presets, 90s film vibes, and lifestyle shots in a hands‑on thread Launch thread. Access is confirmed by multiple posts and the official site Model now free Soul is free Higgsfield. The examples include step‑by‑step prompts (shared via ALT text) and workflows mixing Soul with video tools for animations Prompt pack. For teams evaluating image stacks, this materially lowers trial friction while raising expectations on realism-to-effort ratios.

Google pushes Nano Banana; Gemini tops iOS downloads and a Nano Banana ad passes 20M views

Gemini climbed to the most‑downloaded iOS app in the US iOS chart, with leadership signaling more to ship Demis congrats. A Nano Banana creative ad surpassed 20M views, highlighting image gen as a mass‑market onboarding hook Ad views. This comes in context of workflow that showed a Nano Banana storytelling pipeline; today’s push suggests image features are converting broader users faster than audio/video Google moving fast.

MoonValley debuts Marey video model; ~$18/min and licensed‑only training

MoonValley launched Marey for text‑to‑video and image‑to‑video, trained exclusively on licensed HD footage (no scraped or user uploads) Model overview. On the Artificial Analysis arena it ranks #12 for text‑to‑video and #21 for image‑to‑video, near Seedance 1.0 Mini and Vidu Q1 peers Prompt set Video arena. Pricing lands around $18/min both on MoonValley’s site and fal.ai per the summary Model overview. The thread also contrasts cost/quality against incumbents (e.g., Veo 3 No Audio at ~$30/min vs LTXV v0.9.7 at ~$1.20/min) to frame trade‑offs Model overview.

Seedream 4 + Kling 2.1: start/end‑frame video workflow for creators

A practical thread walks through a fast way to make viral clips: capture a real starting frame, generate an ending frame with Seedream 4 (prompt in ALT), then feed both into Kling 2.1 to animate; stitch outputs in any editor How‑to thread Kling step Editing tip. The post emphasizes minimal setup and provides prompts, positioning this as a repeatable template for short‑form social content Follow‑ups.


🛡️ Security, Safety and Governance

Safety and governance threads: California SB‑53 AI safety bill, Anthropic DC push vs China adoption, news publisher lawsuits, chip trade probes, and prompt‑injection risks for net‑enabled agents.

California SB‑53 clears Legislature: AI safety disclosures, whistleblower protections and CalCompute now await Newsom

California’s SB‑53 passed both chambers and heads to the governor. The bill compels larger AI developers to disclose safety protocols, protects whistleblowers, and launches a state‑run compute cloud (CalCompute). Reporting scales by revenue (full transparency for $500M+ firms; lighter duties for smaller labs). It now awaits Gov. Newsom’s signature or veto amid concerns about conflicts with federal/EU rules bill recap.

Anthropic pitches DC: $1 Claude for Government/Enterprise for 1 year, FedRAMP High, and national security pilots

Anthropic urged faster US government AI rollout, warning China is moving quicker on public‑sector AI. To speed adoption, it’s offering Claude for Enterprise and Claude for Government at $1 for one year via a GSA OneGov deal, citing 100,000s of federal users already on Claude. It highlighted FedRAMP High posture, availability to national security users, work at Lawrence Livermore, and Pentagon CDAO pilots; other vendors are matching discounts to enable multi‑stack trials FedScoop writeup.

China opens dual probes into US chip policies and analog chip dumping ahead of new trade talks

Beijing launched two investigations: (1) alleged discrimination in US chip policy (export controls, tariffs, CHIPS Act) that blocks Chinese firms from advanced tech and markets; (2) alleged dumping of US analog chips in China at unfair prices. The timing coincides with new US‑China trade talks in Spain, escalating tech‑trade tensions that directly affect AI hardware supply chains chip probes image.

OpenAI flags usage policy update on September 17, 2025

OpenAI’s policies page shows an upcoming update dated Sep 17, 2025, directing readers to a new version. Organizations with compliance or governance controls should anticipate reviewing enforcement and scope changes once live and adjust internal guardrails accordingly policies page.

OpenAI warns Codex internet access can be prompt‑injected; recommends strict allowlists and limited HTTP methods

OpenAI’s Codex docs caution that enabling agent internet access introduces prompt‑injection, data exfiltration, malware, and licensing risks. Guidance: keep internet access off by default; when required, use narrow domain allowlists, prefer safe methods (GET/HEAD/OPTIONS), and consider preset allowlists for common dependencies to reduce blast radius internet access doc Codex internet access. This lands in context of MCP firewall (open‑source firewall to curb agent data exfiltration), extending defensive posture from tooling to configuration hardening.


🗣️ Voice AI, STT/TTS and Productions

Speech tech in production: ElevenLabs STT deployment and new managed service combining AI + human editing. Mostly enterprise case studies and pricing.

ElevenLabs launches Productions, a managed dubbing/captions/transcripts/audiobooks service at $2/min

ElevenLabs rolled out Productions, an end‑to‑end, human‑in‑the‑loop service for creators and media teams covering dubbing, captions/subtitles (with SDH), transcripts and audiobooks, priced from $2.00/min with priority support and studio‑grade post‑processing Service launch. It’s already powering dubs for Dude Perfect and Andrew Huberman and is used on Hollywood film work Dubbing details. Ordering happens inside ElevenLabs accounts, and they’re recruiting a Producer network for linguists/localization pros Producer signup. Full offering and pricing are outlined in the product page ElevenLabs Productions. In context of hackathon voices where ElevenLabs amplified agent devs, this formalizes a managed pipeline for broadcast‑quality output with human QA. Capabilities by line of business: dubbing Dubbing details, captions/subtitles with accessibility tags Captions info, 99%‑accuracy transcripts backed by Scribe STT Transcripts info, and audiobook polishing Audiobooks workflow.

CARS24 transcribes 20k hours/month in 14 languages; +35% conversions, −40% disputes on ElevenLabs STT

CARS24 runs 20,000 hours/month of multilingual call audio (Hindi, English and code‑mixed dialects across 14 languages) through ElevenLabs Speech‑to‑Text on Google Cloud to flag issues in real time and generate ops insights Customer story. Reported impact: +35% first‑visit conversions, −40% evaluation disputes, +25% CSAT, and 50% faster resolutions, with accuracy claims up to 98.5% in their ambient capture setup ElevenLabs case study. The workflow spans encrypted pipelines, PII redaction, live quality monitoring and alerts, feeding hub performance dashboards Results summary.

ElevenLabs ships v0 Agents Starter to add voice agents to apps in minutes

ElevenLabs Devs released a v0 Agents Starter that lets teams configure an agent in ElevenLabs, clone a v0 starter template and converse with their app immediately—useful for quick PoCs and onboarding conversational features Starter intro. All starters are listed in one place with setup guidance Starters index, with an external quickstart link for the ElevenLabs v0 template Starter guide. This complements Productions and STT pipelines by lowering the lift to prototype voice‑first app flows.


🔌 Orchestration & MCP Interop

MCP connectors and standards, agent formats, and tool routing. Mostly MCP plugins, AGENTS.md standardization, and Live MCP tool‑calling evals. Excludes GPT‑5‑Codex feature items.

Codex CLI v0.36 adds MCP integration and safer internet access with approval modes

OpenAI’s Codex CLI update integrates MCP (with three approval modes), adds images and to‑dos, and tightens internet access controls. The official notes emphasize faster cloud envs and tool use at scale, with MCP as the interop backbone Release highlights. The new internet access docs warn plainly about prompt injection and data exfiltration, recommending strict allowlists and limited HTTP verbs for safety in agent runs Internet access docs. Early users hit stream errors that were patched quickly, signaling active hardening of the stack Stream error, Fix shipped. Together these changes make Codex a cleaner citizen of the MCP/tool ecosystem while reducing security foot‑guns.

AGENTS.md gains traction as a cross‑agent contract; Codex prompt enforces it

The AGENTS.md pattern—machine‑readable repo instructions for coding agents—is consolidating. Community voices push to deprecate CLAUDE.md in favor of AGENTS.md across agents like Codex, Amp, Jules, Cursor, RooCode, and more Agent doc standard. OpenAI’s Codex system prompt explicitly instructs agents to discover and obey AGENTS.md, scoping precedence rules and programmatic checks for PRs Codex prompt leak. Changelog/UI updates reinforce multi‑environment behavior that benefits from a consistent agent contract Codex changelog. Result: fewer bespoke conventions and smoother tool interop across ecosystems.

LiveMCPBench launches: 70 MCP servers and 527 tools benchmarked for real-world coverage

A new wave of MCP tool‑calling benchmarks landed this week—headlined by LiveMCPBench covering 70 MCP servers and 527 tools—plus complementary suites for stability and failure modes MCP benchmarks. The emphasis shifts from prompt toys to end‑to‑end tool reliability, a key metric for agents spanning heterogeneous stacks. This arrives in context of MCP pair programming, where we saw practical MCP adoption for dev workflows, and helps teams choose robust servers and surface breakages early.

GenKit adds MCP plugin for Go/JS to connect external servers or expose your own tools

Firebase’s GenKit now ships an MCP plugin for Go and JavaScript: connect to external MCP servers, manage multiple connections, and even expose your GenKit tools as an MCP server. The docs call out using test servers (time/everything) or custom ones, turning GenKit flows into first‑class tools inside agent stacks GenKit MCP plugin. This tightens interop between app logic and shared tool buses, reducing bespoke integrations across agent frameworks.


🧬 New Models & Open Weights

Model drops beyond coding: UI‑action VLMs, image/video models, open‑sourced reasoning and DP training. Mostly Holo1.5, HunyuanImage 2.1, Marey, ERNIE 4.5 A3B; MobileLLM‑R1 tease. Excludes GPT‑5‑Codex.

H Company ships Holo1.5 CU models (3B/7B/72B), 7B under Apache 2.0

H Company released Holo1.5, a family of computer‑use (CU) VLMs with +10% accuracy over Holo1 and SOTA UI localization/VQA; sizes are 3B (research‑only), 7B (Apache 2.0), and 72B (research‑only) Launch thread. Benchmarks claim it beats open baselines (Qwen‑2.5 VL), outperforms closed Sonnet 4 and tops specialized systems like UI‑TARS/Venus Benchmarks note. Models and docs are live on Hugging Face and the blog Hugging Face page H Company blog. Demos emphasize precise element localization (click X,Y) and UI VQA for reliable web/desktop/mobile navigation UI demo.

Google unveils VaultGemma 1B trained with differential privacy (ε≈2)

VaultGemma 1B is a DP‑trained LLM using DP‑SGD with Poisson sampling and new scaling laws tying compute/data/privacy to utility. Google reports a 50‑token prefix test showing no detectable memorization, with performance near GPT‑2 1.5B while preserving privacy (sequence‑level ε≈2, δ≈1.1e‑10 for 1024‑token units) Google blog post. The work details how larger batches/smaller models and tuning iterations stabilize training under DP noise Overview thread. Full blog explains the noise‑batch ratio as a key control knob for setting quality vs. privacy Scaling insights.

HunyuanImage 2.1 leads open‑weights image arena; tight license, $100/1k images

Tencent’s HunyuanImage 2.1 (17B DiT, native 2048×2048, bilingual) is now the top open‑weights text‑to‑image model on the Artificial Analysis arena, edging HiDream‑I1‑Dev and Qwen‑Image Arena thread. Availability and pricing: on fal at ~$100 per 1k images; a Hugging Face demo is live Arena link. The Tencent Community License restricts >100M MAU products, bars EU/UK/South Korea use, and forbids using outputs to improve non‑Hunyuan models Arena thread. See side‑by‑side prompts and results in the arena listing Image arena.

Baidu released ERNIE‑4.5‑21B‑A3B‑Thinking, a mixture‑of‑experts model focused on strong reasoning. As of Sept 11, it ranked #1 among trending text‑generation models on Hugging Face Trending note, with community posts highlighting the open‑sourcing as “built for serious reasoning” Announcement recap.

SRPO (Semantic Relative Preference Optimization) + Direct‑Align refines diffusion outputs without over‑optimizing late steps, with text‑conditioned rewards that reduce reliance on offline reward tuning. Applied to FLUX.1.dev, human‑rated realism/aesthetics improved over 3× in reported tests Model card link. It’s now #1 trending on Hugging Face Trending snapshot. Source assets include the model page and a live Space demo for quick trials HF model HF Space.

MobileLLM‑R1 (<1B) lands on Hugging Face for edge reasoning

The sub‑1B‑parameter, edge‑focused reasoning model MobileLLM‑R1 is now on Hugging Face HF drop. This follows the in‑browser rollout noted in browser demo, where it ran 100% locally via Transformers.js. Roundup threads also list it among recent notable releases Release roundup.


✨ Agents Dev Tooling Coding With Ai

Backfill placeholder for migration.

Claude Code 1.0.115 improves thinking UX; adds /t to toggle thinking and trims post‑tool noise

Latest Claude Code (v1.0.115) upgrades the thinking mode display, adds /t to temporarily disable thinking in a prompt, and reduces post‑tool output clutter changelog card. Devs also share effective usage patterns: embed “comment directives” as mini‑prompts in code (e.g., @implement) for precise, contextual changes comment directives directive blog, and compose tasks with subagents to keep agent responsibilities single‑purpose and auditable subagents take.

Claude Code 1.0.115 ships thinking toggle and cleaner hooks; tips on ‘comment directives’

Claude Code v1.0.115 adds /t to temporarily disable thinking mode, improves thinking display, trims post‑tool output, and polishes feedback cues version notes. Power users also advocate “comment directives” (mini‑prompts embedded in code) to localize instructions and keep context grounded while letting the agent act on them directive guide how‑to article. These changes make it easier to balance deliberate reasoning with snappy edits inside real codebases.

Codex cloud internet access arrives with strict allowlists; beware prompt injection

Codex cloud now supports controlled internet access, but docs emphasize tight domain allowlists and method restrictions (GET/HEAD/OPTIONS) to mitigate prompt injection, exfiltration, and license risks internet access docs. Community commentary underscores the danger: a relaxed policy can be trivially abused to leak keys or sensitive data—treat networked agents as hostile by default security comment.

DSPy hits 80% in 8 GEPA iters; workshops expand the “programming, not prompting” movement

New runs show GPT‑4o reaching ~80% on an eval after just 8 GEPA iterations in DSPy, reinforcing the library’s “optimize the program, not ad‑hoc prompts” approach eval jump. In context of DSPy momentum (stateful modules, signatures), creators demo 3‑line DSPy chains on local Ollama, plus upcoming agent‑dev workshops and Ruby GEPA support local demo workshop plan Ruby GEPA. Devs describe the shift succinctly: programming instead of prompting dev quote.

DSPy shows fast GEPA gains; 80% in 8 iterations on GPT‑4o

A new result shows GPT‑4o jumping to ~80% on an eval in just 8 GEPA iterations using DSPy, underscoring rapid optimization loops GEPA result. This comes in context of DSPy momentum where we noted its growing role as a declarative agent stack. The community continues to highlight “programming instead of prompting” as a scalable path for agents developer comment workshop plan, with multiple teams adopting DSPy patterns across real projects case study.

Opencode Zen: curated, at‑cost gateway for top coding models with stable deployments

Opencode Zen launches as a coding‑only model gateway that routes to high‑quality deployments (e.g., GPT‑5, Claude Sonnet/Opus, Qwen3 Coder 480B, Kimi K2) at cost, emphasizing reliability (no random stops/tool parsing errors) and easy drop‑in endpoints launch thread design goals endpoint docs. It targets consistent output across providers and can be used from any tool, with the team reporting large eval gains from deployment tuning on some models using K2.

Xcode 26 adds GPT‑5 and Claude Sonnet 4 integrations for in‑IDE coding assistance

Developers can now sign in with a ChatGPT account to use GPT‑5 directly in Xcode 26 Xcode with GPT‑5, while Anthropic made Claude in Xcode generally available with docs, explain‑code, previews, and playgrounds powered by Sonnet 4 Claude in Xcode. The race for “vibe coders” inside Xcode is on, with side‑by‑side experiences circulating in the community comparison card.

AGENTS.md gains traction as cross‑agent standard for repo instructions

Community momentum is building behind AGENTS.md—a simple, open format for guiding coding agents across tools (Codex, Amp, Jules, Cursor, and others)—with debates about deprecating CLAUDE.md in favor of a single convention ecosystem debate. The file scopes conventions, build/test steps and PR guidelines that agents should follow, and is increasingly referenced in agent prompts and docs AGENTS.md mention. The spec is publicly hosted for reuse and interop AGENTS.md page. Security‑minded threads in parallel flag risks of agent internet access and push for safe defaults and allowlists internet access docs.

Factory syncs CLI sessions to web; Amp demos long‑run planning then rapid edits

Factory now mirrors sessions started from the CLI into its web UI—handy for controlling droid agents from a phone while AFK session sync. Builders show Amp planning timelines in “dev days,” then implementing in minutes (useful for scoping vs. execution) planning demo. Recent Codex tests also highlight better long‑run troubleshooting with quick follow‑up diffs in the refreshed UI CLI experience, and teams are wiring dynamic OG previews directly from Amp content preview builder.

Getting started fast: Codex IDE extension with ChatGPT sign‑in, no extra cost

You can run GPT‑5‑Codex in VS Code/Cursor by installing the Codex extension and signing in with a ChatGPT account—no separate API billing needed; choose local vs cloud, agent permission levels, and model (GPT‑5 or GPT‑5‑Codex low/medium/high) from the footer install steps usage tips. Seats (Plus/Pro/Business) offer different weekly coding budgets; many projects fit within Plus limits seat limits.


✨ Benchmarks Observability Evals

Backfill placeholder for migration.

LiveBench adds GPT‑5 Pro; High and Pro now tie at the top

LiveBench added a GPT‑5 Pro tier and shows High and Pro with nearly identical reasoning scores, topping the leaderboard leaderboard update. This refines yesterday’s picture, when GPT‑5 High took the lead in LiveBench (initial update). The new entry suggests tier parity on LiveBench’s hardest tasks; watch for future deltas as routing/think time settings evolve leaderboard update.

UQ benchmark: only ~15% of real unsolved questions pass automated validation

The UQ benchmark curates 500 unanswered Stack Exchange questions (LLM‑screened for clarity/difficulty) and evaluates models as validators (not generators). Early results: only ~15% of proposed answers pass automated checks, exposing headroom vs contrived exams benchmark overview paper link. The design stresses well‑definedness, approachability, and objective verifiability, and runs continuous, public validation flows to credit genuine solutions over time design goals methods recap.

Study: token efficiency varies 1.5–4× across reasoning models; open weights often spend more

NousResearch’s analysis of “thinking efficiency” finds large spread in chain‑of‑thought token usage: open‑weight families frequently consume 1.5–4× more tokens than closed models for similar accuracy, with outliers 10× on simple knowledge queries token efficiency post. The authors argue token parsimony is a critical metric alongside quality/latency. Community commentary echoes this shift as routing and variable “thinking effort” become competitive levers in production agents efficiency comment.

Vercel Drains ships unified export for traces, logs, analytics and Speed Insights

Vercel introduced Drains to stream OpenTelemetry traces, logs, Web Analytics events, and Speed Insights to custom endpoints or partners (Datadog, Honeycomb, Grafana, etc.) with sampling, JSON/NDJSON/Protobuf formats, and security headers product blog Vercel blog post. The pipeline correlates browser‑to‑server signals for precise debugging and performance analysis, reducing the need for ad‑hoc wiring in AI apps where latency spikes or tool‑call failures must be traced end‑to‑end.


✨ Business Funding Enterprise

Backfill placeholder for migration.

OpenAI + Anthropic usage studies: non‑work >70%, business automation 77%

Two first‑party reports land the same morning. OpenAI’s working paper (700M weekly users; ~18B msgs/week) shows non‑work usage rising from 53% (Jun ’24) to >70% (Jun ’25); top intents are Practical Guidance, Seeking Information, and Writing; coding is ~4.2% of consumer chats key takeaways, OpenAI paper. Anthropic’s Economic Index finds usage per capita strongly correlates with income; directive automation in Claude grew from 27%→39% (Dec ’24–Aug ’25); and first‑party API customers automate 77% of business tasks (vs ~50% for Claude.ai users) index highlights, Anthropic report. Analysts also note faster growth in low‑/mid‑income countries for ChatGPT and uneven geographic enterprise adoption for Claude both reports.

Anthropic offers Claude for Government/Enterprise at $1 for 1 year

To accelerate US government adoption, Anthropic is pitching Claude for Government and Enterprise for $1 for one year via GSA OneGov, paired with FedRAMP High posture and pilots at LLNL and DoD CDAO (OTA) DC pitch. The company frames speed as a national‑competitiveness issue versus rapid Chinese AI rollout across services and industry, and says other vendors are matching discounts so agencies can test stacks in parallel DC pitch.

Gemini edges past ChatGPT in US iOS downloads; Nano Banana ad tops 20M views

Gemini is now the most‑downloaded free iOS app in the US, edging past ChatGPT app rankings, with Google’s “Nano Banana” creative pushing 20M+ views in paid/social distribution ad views. Demis Hassabis congratulated the Gemini app team, calling it “just the start” Hassabis note, as Alphabet crossed a $3T market cap—market signals of momentum and spend behind Gemini’s consumer funnel $3T milestone. This comes in context of App Store #1 where Gemini first hit #1 while MAUs still trailed ChatGPT.

CARS24’s voice AI: +35% conversions, −40% disputes using ElevenLabs STT

India’s CARS24 processes ~20,000 hours/month of multilingual customer conversations (14+ languages, high code‑mix) with ElevenLabs Speech‑to‑Text on Google Cloud, turning calls into real‑time insights. Reported outcomes: +35% first‑visit conversions, −40% evaluation disputes, +25% CSAT, with 98.5% transcription accuracy and 50% faster resolutions deployment thread, results, case study. Privacy measures include encrypted pipelines and PII redaction case study.

Micro1 raises $35M at $500M valuation; ARR jumps to $50M

Data‑labeling startup Micro1 closed a $35M Series A led by 01 Advisors at a ~$500M valuation, added ex‑Twitter COO Adam Bain to its board, and says ARR climbed from $7M to $50M in 2025 as it supplies domain‑expert annotators vetted by its AI recruiter ‘Zara’ funding note. The company positions against Scale AI’s fallout and rivals like Surge/Mercor, and is prepping “environment” tools for training agents—signals of continued spend on high‑skill labeling as labs push agentic workflows funding note.

Xcode 26 ships with GPT‑5 and Claude Sonnet 4 integrations for coding

Apple’s Xcode 26 now supports ChatGPT accounts with GPT‑5 built‑in for code assistance GPT‑5 in Xcode, while Anthropic announced general availability of Claude Sonnet 4 in Xcode for documentation, code explain, previews/playgrounds Claude in Xcode, Anthropic blog. The twin integrations signal deeper IDE‑native adoption paths across the iOS/macOS developer base.

ComfyUI unveils Comfy Cloud private beta: full Comfy in your browser

ComfyUI announced Comfy Cloud—“the full power of ComfyUI, now in your browser,” with no installs—opening a waitlist for creators and studios that already use Comfy at scale beta signup, Comfy Cloud site. A follow‑up tease hints at another announcement in ~12 hours, underscoring product velocity and demand for browser‑native creative stacks launch tease, waitlist page.

OpenAI tests in‑app Orders tab for ChatGPT, hinting native commerce

A hidden “Orders” section appeared in ChatGPT desktop settings, showing purchase history and implying upcoming checkout/order tracking inside ChatGPT—an early signal of native commerce monetization beyond subscriptions Orders tab. If rolled out, it would widen OpenAI’s revenue surface beyond API and plans by embedding transactions in assistant workflows.


✨ Data Retrieval Graphrag

Backfill placeholder for migration.

Youtu‑GraphRAG fuses schema‑guided graph building with agented retrieval, cutting tokens and lifting accuracy

Tencent Youtu‑GraphRAG reports up to 90.71% fewer tokens and 16.62% higher accuracy versus strong baselines by unifying a schema‑driven extraction pipeline (triples + community grouping into a 4‑level knowledge tree) with a retrieval agent that decomposes queries into schema‑valid sub‑queries and reflects until consistent answers are found Paper thread. Code/dataset links are referenced alongside the arXiv preprint; the approach emphasizes fewer wasted hops and tighter reasoning in complex multi‑hop questions Figure summary. Positioned as a continuation of deterministic pipelines seen in ParserGPT one‑time site adapters, but here the schema governs both build and query to constrain noise while improving answerability Paper thread.

DocWrangler earns UIST recognition for a mixed‑initiative semantic ETL IDE used 1,500+ times

Berkeley’s DocWrangler, a mixed‑initiative IDE for semantic data processing, received a Best Paper Honorable Mention and documents how analysts cross three gulfs: comprehension (make sense of messy docs), specification (express exact intent), and generalization (pipelines that hold beyond a small set) Award note, Three gulfs. The team shipped features like in‑situ notes, model‑assisted prompt refinement, and operation decomposition; an online deployment logged 1,500+ uses revealing real pipelines people author Deployment stats. The thread walks through why writing/iterating the pipeline clarifies user intent as much as it instructs the model Intent clarifies, with links to try the tool and the paper Links.

A practitioner course distills production RAG: metrics that predict, rerankers, and FT’d embeddings

A hands‑on RAG course (closing soon) from an engineer who built search systems generating $50M+ revenue outlines a concrete recipe: synth data mirroring real queries, precision/recall metrics that actually predict prod behavior, reranking pipelines for 10–20% instant gains, embedding fine‑tuning on your data, and index specialization/segmentation Course outline. It includes credits (Chroma, Cohere, Modal) and live materials; enroll details and audience (engineers from OpenAI/Anthropic/Google, tech leads at Salesforce/Adobe/Shopify) are provided Enroll details. This is a rare, production‑first consolidation of techniques teams can apply immediately to lift retrieval quality.

Context compression patterns in Vercel AI SDK reduce prompt bloat in agent tool loops

Developers highlighted using AI SDK’s prepareStep and toModelOutput to programmatically compress prior messages while retaining full UI history—useful for long tool‑calling loops where context windows are stressed Code snippet. With AI SDK 5 abstracting chat logic and enabling useChat‑style hooks across frameworks, it’s straightforward to standardize compression filters that preserve salient state while trimming the rest SDK 5 changes. This is a pragmatic building block for scalable RAG/agent pipelines that frequently re‑enter tools without exploding token budgets Code snippet.

HANRAG debuts as a heuristic, noise‑resistant multi‑hop RAG framework on Hugging Face

Ant Group introduced HANRAG on Hugging Face, a heuristic RAG framework aimed at efficient, noise‑resistant multi‑hop retrieval over complex corpora HF announcement. While details are brief in the feed, the release positions HANRAG for scenarios where classic chunk‑and‑fetch pipelines struggle with distractors and long dependency chains, offering a practical alternative for production‑grade multi‑hop QA HF announcement.


✨ Hardware Accelerators

Backfill placeholder for migration.

vLLM 0.10.2 adds official aarch64; runs natively on NVIDIA GB200

vLLM shipped 0.10.2 with first‑class aarch64 support, so you can install directly on NVIDIA’s GB200; the Docker image is now multi‑platform as well, easing ARM server rollouts and mixed‑arch fleets aarch64 support. This reduces custom builds and aligns with emerging GB200 deployments for high‑throughput serving.

Together brings GB300 racks online as burn‑in begins

Together Compute started GB300 burn‑in, signaling new capacity coming online for large‑scale training/inference rack photo. For platform teams, this is a near‑term indicator of more high‑bandwidth cluster supply entering the market (useful for long‑context and parallel reasoning workloads).

Fireworks claims GPU inference surpasses an ASIC on GPT‑OSS workload

Fireworks reported GPU inference throughput beating an ASIC provider on an AA‑run GPT‑OSS benchmark—the first time they say a GPU crossed an ASIC in their tests benchmark claim. If reproducible, this narrows the accelerator trade‑space and favors GPU‑first runtime stacks (faster iteration, wider tooling) for certain model profiles.


✨ Infrastructure

Backfill placeholder for migration.

OpenAI lifts API throughput caps: GPT‑5 to 4M TPM, GPT‑5‑mini starts at 500k TPM

OpenAI quietly raised API rate limits, pushing GPT‑5’s top tier to 4 million tokens/minute and bumping GPT‑5‑mini’s entry tier to 500k TPM, expanding headroom for high‑throughput agent workloads and batch processing Throughput update.

Codex Cloud’s cached containers cut median completion time ~90% on remote tasks

Beyond the GPT‑5‑Codex model, OpenAI’s platform updates matter for infra: cloud agents now auto‑set environments and use container caching to slash median completion time by ~90% (faster cold starts, fewer dependency stalls). Devs report snappier delegations and smoother local↔cloud handoffs Feature roundup Model overview OpenAI blog post.

Vercel launches Drains to stream logs, traces, analytics to any destination

Vercel Drains unifies export of OpenTelemetry traces, Web Analytics, Speed Insights and logs—either to turnkey integrations (Datadog, Honeycomb, Grafana, etc.) or custom HTTP drains. Useful knobs include sampling, Protobuf/NDJSON formats, and security headers; cross‑signal correlation ties browser LCP spikes to backend traces for faster RCA Vercel blog Introducing Vercel Drains.

ComfyUI announces Comfy Cloud private beta: full Comfy in the browser

Comfy Cloud brings the popular node‑based gen‑media workflow to a managed browser experience—no install, instant start—aimed at individuals and studios that already prototype locally but want elastic infra for heavier runs and sharing Private beta Comfy Cloud site. ‘More tomorrow’ hints at rapid feature cadence Tease.

Developers highlight Codex cloud env automation and cache: fewer stalls, faster PR checks

Multiple early users call out the infra wins: cloud containers pre‑warm and auto‑install missing deps, significantly reducing idle time and failed setup; PR review agents attach deep, code‑executed findings with fewer low‑value comments—improving CI cycles and review latency Developer verdict OpenAI digest.

PrimeIntellect debuts Reserved Instances for 8–1,000+ GPU clusters with multi‑vendor quotes

A new marketplace flow lets teams request clusters from 8 up to 1,000+ GPUs and receive quotes from 50+ providers within 24 hours—smoothing capacity planning for training and inference bursts, and spreading risk across vendors Reserved instances.

Together’s GB300 cluster begins burn‑in, following earlier NVL72 rack rollout

TogetherCompute showed GB300 hardware in burn‑in—another signal of fast‑expanding capacity for large‑scale training/inference. This advances the ramp noted in GB200 NVL72 racks where an NVL72 rack push was highlighted Rack photo.

gb300 rack

vLLM 0.10.2 ships official aarch64 support; install directly on NVIDIA GB200

vLLM v0.10.2 can now be installed on aarch64, with a multi‑platform Docker image that pulls the right arch automatically. Teams running GB200 racks get a more straightforward path to deploy high‑throughput inference with vLLM APIs vLLM release.

Baseten adds Qwen3‑Next‑80B‑A3B Thinking for managed inference on NVIDIA stack

Baseten published a ready‑to‑deploy Qwen3‑Next‑80B‑A3B (Thinking) image in its model library, touting strong benchmark parity vs larger closed models and straightforward spin‑up on the Baseten inference stack running on NVIDIA GPUs Model library Baseten model page.


✨ Training Optimizers Reasoning

Backfill placeholder for migration.

K2‑Think: 32B reasoning system rivals 120B baselines with Cerebras‑accelerated throughput

Built on Qwen2.5‑32B, K2‑Think combines long CoT SFT, RL on auto‑verifiable tasks, plan‑then‑solve, and generate‑3‑then‑select. It matches or beats much larger models on math and stays strong on code/science while serving ~2,000 tok/s on Cerebras WSE overview.

AgentGym‑RL: staged multi‑turn RL trains long‑horizon agents

ByteDance’s AgentGym‑RL splits environments, agent loop, and training to stabilize multi‑turn RL for long tasks. ScalingInter‑RL starts short, then gradually extends horizons, preventing loops and encouraging planning/recovery. A 7B model trained across 27 tasks competes with larger proprietary systems; compute at train/test time often beats adding parameters paper thread.

CoT‑Space theory: optimal chain length and step‑level learning

A new framework models reasoning at the step level (not token‑by‑token), predicting a U‑shaped error curve with an optimal chain length that balances under‑ and over‑thinking. It argues dense “meaning space” paths make step‑level RL more faithful to real reasoning, and explains why very long chains overfit while very short ones underfit paper thread. The authors show harder datasets push optimal chain length up and larger models prefer shorter chains to avoid overfitting experiments. Full details in ArXiv paper.

Small accuracy gains compound to much longer tasks in new long‑horizon study

New evidence shows modest per‑step accuracy improvements can exponentially extend the length of tasks LLMs can complete; failures often stem from execution, not reasoning. Larger models sustain longer horizons even when smaller ones ace single steps paper threadpaper link. This extends prior long‑horizon coverage long horizon by quantifying compounding effects and highlighting self‑conditioning risks. Full analysis in ArXiv paper.

DeepDive: multi‑turn RL + hard data lifts deep‑search agents on web benchmarks

THUDM synthesizes hard, multi‑hop questions from knowledge graphs and trains end‑to‑end with strict reward gating (format + exact answer). A 32B agent hits 14.8% on BrowseComp, outperforming open agents like WebSailor/DeepSeek‑R1‑Browse; test‑time scaling (parallel tool calls + fewest‑calls selection) nearly doubles accuracy method TL;DRresultsscaling. See ArXiv paper and GitHub repo.

Gensyn’s SAPO: decentralized RL with rollout sharing boosts learning by up to 94%

SAPO trains across a heterogeneous “swarm,” sharing plain‑text rollouts among nodes (from laptops to servers). Nodes mix local and external experiences, propagate “aha” trajectories, and update via PPO—yielding up to 94% higher cumulative rewards vs. no sharing method explainerresults. Paper and code notes in ArXiv paper.

Meta’s outcome‑based exploration keeps accuracy while preserving answer diversity

FAIR shows standard RL that rewards only final answers boosts accuracy but collapses answer diversity, hurting test‑time sampling. Two fixes—historical and batch outcome‑based exploration—reward rare outcomes or penalize repeats to retain variety without sacrificing accuracy, validated on Llama and Qwen math tasks paper overview.

On this page

Executive Summary
🦿 Robotics and Embodied AI
OpenAI ramps robotics hiring with teleoperation and Isaac Sim focus
NavFoM unifies navigation across drones, quadrupeds and vehicles with 8M‑sample training
SUSTech “violence test” shows fall‑robust humanoid control under hard perturbations
Figure teases three days of announcements amid F.03 and Helix AI buzz
Tencent teases new Hunyuan 3D model after ultra‑long world model reveal
Unitree’s precision demos trend as company prepares to go public
🎨 Gen Media, Vision and Creative Stacks
Higgsfield makes Soul image model free for everyone
Google pushes Nano Banana; Gemini tops iOS downloads and a Nano Banana ad passes 20M views
MoonValley debuts Marey video model; ~$18/min and licensed‑only training
Seedream 4 + Kling 2.1: start/end‑frame video workflow for creators
🛡️ Security, Safety and Governance
California SB‑53 clears Legislature: AI safety disclosures, whistleblower protections and CalCompute now await Newsom
Anthropic pitches DC: $1 Claude for Government/Enterprise for 1 year, FedRAMP High, and national security pilots
China opens dual probes into US chip policies and analog chip dumping ahead of new trade talks
OpenAI flags usage policy update on September 17, 2025
OpenAI warns Codex internet access can be prompt‑injected; recommends strict allowlists and limited HTTP methods
🗣️ Voice AI, STT/TTS and Productions
ElevenLabs launches Productions, a managed dubbing/captions/transcripts/audiobooks service at $2/min
CARS24 transcribes 20k hours/month in 14 languages; +35% conversions, −40% disputes on ElevenLabs STT
ElevenLabs ships v0 Agents Starter to add voice agents to apps in minutes
🔌 Orchestration & MCP Interop
Codex CLI v0.36 adds MCP integration and safer internet access with approval modes
AGENTS.md gains traction as a cross‑agent contract; Codex prompt enforces it
LiveMCPBench launches: 70 MCP servers and 527 tools benchmarked for real-world coverage
GenKit adds MCP plugin for Go/JS to connect external servers or expose your own tools
🧬 New Models & Open Weights
H Company ships Holo1.5 CU models (3B/7B/72B), 7B under Apache 2.0
Google unveils VaultGemma 1B trained with differential privacy (ε≈2)
HunyuanImage 2.1 leads open‑weights image arena; tight license, $100/1k images
Baidu open‑sources ERNIE‑4.5‑21B‑A3B‑Thinking; #1 trending text‑gen on HF
Tencent SRPO alignment tops HF trending; boosts FLUX realism 3×
MobileLLM‑R1 (<1B) lands on Hugging Face for edge reasoning
✨ Agents Dev Tooling Coding With Ai
Claude Code 1.0.115 improves thinking UX; adds /t to toggle thinking and trims post‑tool noise
Claude Code 1.0.115 ships thinking toggle and cleaner hooks; tips on ‘comment directives’
Codex cloud internet access arrives with strict allowlists; beware prompt injection
DSPy hits 80% in 8 GEPA iters; workshops expand the “programming, not prompting” movement
DSPy shows fast GEPA gains; 80% in 8 iterations on GPT‑4o
Opencode Zen: curated, at‑cost gateway for top coding models with stable deployments
Xcode 26 adds GPT‑5 and Claude Sonnet 4 integrations for in‑IDE coding assistance
AGENTS.md gains traction as cross‑agent standard for repo instructions
Factory syncs CLI sessions to web; Amp demos long‑run planning then rapid edits
Getting started fast: Codex IDE extension with ChatGPT sign‑in, no extra cost
✨ Benchmarks Observability Evals
LiveBench adds GPT‑5 Pro; High and Pro now tie at the top
UQ benchmark: only ~15% of real unsolved questions pass automated validation
Study: token efficiency varies 1.5–4× across reasoning models; open weights often spend more
Vercel Drains ships unified export for traces, logs, analytics and Speed Insights
✨ Business Funding Enterprise
OpenAI + Anthropic usage studies: non‑work >70%, business automation 77%
Anthropic offers Claude for Government/Enterprise at $1 for 1 year
Gemini edges past ChatGPT in US iOS downloads; Nano Banana ad tops 20M views
CARS24’s voice AI: +35% conversions, −40% disputes using ElevenLabs STT
Micro1 raises $35M at $500M valuation; ARR jumps to $50M
Xcode 26 ships with GPT‑5 and Claude Sonnet 4 integrations for coding
ComfyUI unveils Comfy Cloud private beta: full Comfy in your browser
OpenAI tests in‑app Orders tab for ChatGPT, hinting native commerce
✨ Data Retrieval Graphrag
Youtu‑GraphRAG fuses schema‑guided graph building with agented retrieval, cutting tokens and lifting accuracy
DocWrangler earns UIST recognition for a mixed‑initiative semantic ETL IDE used 1,500+ times
A practitioner course distills production RAG: metrics that predict, rerankers, and FT’d embeddings
Context compression patterns in Vercel AI SDK reduce prompt bloat in agent tool loops
HANRAG debuts as a heuristic, noise‑resistant multi‑hop RAG framework on Hugging Face
✨ Hardware Accelerators
vLLM 0.10.2 adds official aarch64; runs natively on NVIDIA GB200
Together brings GB300 racks online as burn‑in begins
Fireworks claims GPU inference surpasses an ASIC on GPT‑OSS workload
✨ Infrastructure
OpenAI lifts API throughput caps: GPT‑5 to 4M TPM, GPT‑5‑mini starts at 500k TPM
Codex Cloud’s cached containers cut median completion time ~90% on remote tasks
Vercel launches Drains to stream logs, traces, analytics to any destination
ComfyUI announces Comfy Cloud private beta: full Comfy in the browser
Developers highlight Codex cloud env automation and cache: fewer stalls, faster PR checks
PrimeIntellect debuts Reserved Instances for 8–1,000+ GPU clusters with multi‑vendor quotes
Together’s GB300 cluster begins burn‑in, following earlier NVL72 rack rollout
vLLM 0.10.2 ships official aarch64 support; install directly on NVIDIA GB200
Baseten adds Qwen3‑Next‑80B‑A3B Thinking for managed inference on NVIDIA stack
✨ Training Optimizers Reasoning
K2‑Think: 32B reasoning system rivals 120B baselines with Cerebras‑accelerated throughput
AgentGym‑RL: staged multi‑turn RL trains long‑horizon agents
CoT‑Space theory: optimal chain length and step‑level learning
Small accuracy gains compound to much longer tasks in new long‑horizon study
DeepDive: multi‑turn RL + hard data lifts deep‑search agents on web benchmarks
Gensyn’s SAPO: decentralized RL with rollout sharing boosts learning by up to 94%
Meta’s outcome‑based exploration keeps accuracy while preserving answer diversity