Executive Summary

OpenAI’s GPT‑5‑Codex buckled under demand, running about 2× slower than targets before the team added GPUs and restored speed. Limits are reset to compensate, and a safety addendum details code‑specific guardrails, including 100% malware refusals on evals. The launch’s headline capability remains: 7+ hour autonomous coding runs spanning IDEs, CLIs, the web, and GitHub.

In numbers:

2× slower than targets; added GPUs restore latency to nominal levels
Limits reset so users can run more Codex jobs today
Safety: 100% malware refusals on evals; agent sandboxing; network disabled by default
VaultGemma: 1B parameters; ε≤2.0 differential privacy; open weights and code
Hunyuan3D 3.0: 1536³ geometry; 3.6B voxels; 20 free generations via engine/API
HuMo: 17B and 1.7B video models; Apache 2.0 license; mask‑guided lip‑sync
VoxCPM: 0.5B parameters; trained on 1.8M+ hours; zero‑shot voice cloning

Also:

Hunyuan‑MT/Chimera rank first on 30 of 31 WMT2025 language pairs
Ring‑mini‑2.0 MoE: 16B total parameters; 1.4B active parameters
Gemini test models “oceanstone/oceanreef” spotted; prompts note September 2025 cutoff

📑 Table of Contents

🗣️ Voice & Real‑time Apps

Few but notable: VoxCPM tokenizer‑free TTS with zero‑shot cloning; Monologue Mac app for context‑aware dictation with local models; ElevenLabs sharing agent starter and Cloudflare AI Avenue appearance.

OpenBMB launches VoxCPM 0.5B tokenizer‑free TTS with zero‑shot voice cloning

OpenBMB unveiled VoxCPM 0.5B, a tokenizer‑free text‑to‑speech model that delivers context‑aware prosody and zero‑shot voice cloning in a compact footprint trained on 1.8M+ hours. The design avoids discrete acoustic tokens, aiming for more natural, expressive speech at small model sizes suitable for apps and devices. Model announcement

Zero‑shot voice cloning and context‑aware speech generation highlighted as core capabilities, with a 0.5B‑parameter footprint for easier deployment Model announcement
Try the live demo and assets via Hugging Face; code/resources are linked for quick experiments Hugging Face demo
Early community demos show quick TTS app prototyping around the model in Anycoder environments (developer showcase) Anycoder demo

Every’s Monologue debuts: context‑aware Mac dictation with local‑first options

Every launched Monologue, a Mac app that turns speech into structured, polished text tailored to the app you’re in, with a personal dictionary, multilingual support, deep on‑screen context (with permission), and on‑device model options. Early adopters report heavy weekly usage and significant speedups vs typing. Launch thread, Website promo

Features include smart formatting for email/docs/notes/code, automatic proper‑noun and acronym handling, and customizable workflows; free for Every subscribers ($30/mo) or $10/mo standalone early‑bird Monologue site
Team cites “over 1M words/week” written by early users and strong stickiness across creators and developers Launch thread
Community feedback highlights high context accuracy and steady product iteration, including usage stats like 95k words/30 days from power users Usage stat card

ElevenLabs ships v0 Agents Starter to spin up voice agents quickly

ElevenLabs introduced a starter flow for building voice‑enabled agents—configure the agent in ElevenLabs, clone the v0 template, and talk to it—aimed at accelerating real‑time voice agent prototyping and integration, in context of Productions launch managed dubbing/captions rollout. Starter kit

The starter prescribes a minimal setup path (config → template clone → live voice) to reduce time‑to‑pilot for agentic apps Starter kit
The team also appeared on Cloudflare’s AI Avenue discussing how to create authentic‑sounding voices and productionize voice experiences Show teaser, and AI Avenue site
Community events continue (e.g., an AI in Advertising hackathon with Twelve Labs), signaling momentum around voice agent use cases Hackathon note

🤖 Robotics & Embodied AI

OpenAI restarts robotics (humanoid/teleop/Isaac Sim hiring), Unitree open-sources a world‑model+action stack, Figure raises $1B for humanoids; thread highlights simulation, data collection, and scaling for real deployments.

Figure raises $1B at $39B to scale humanoid robots into production

Figure closed a $1B Series C at a $39B post-money valuation to ramp manufacturing and real‑world deployments of its humanoid robots. The round adds strategic depth across chips, carriers, and industrials, signaling a push from R&D to scaled ops.

Lead and syndicate include Parkway Venture Capital, NVIDIA, Intel Capital, LG Technology Ventures, Salesforce, T‑Mobile Ventures, Qualcomm Ventures, Brookfield Asset Management, and Macquarie Capital funding note, investor list
Proceeds target factory scaling (BotQ), field deployments in commercial and household settings, and expanded NVIDIA GPU infrastructure for model training/simulation use of funds
Figure highlights data collection to improve perception across video and multimodal sensors, aligning with a deployment‑driven learning loop use of funds

Unitree open-sources UnifoLM‑WMA‑0 world‑model×action stack on Hugging Face

Unitree released UnifoLM‑WMA‑0, a world‑model plus action architecture spanning multiple robotic embodiments, designed for general‑purpose robot learning. The world model doubles as an interactive simulator for synthetic data and a policy enhancer for long‑horizon control.

Architecture: a unified world model that predicts robot‑environment interactions, supports simulation for data generation, and boosts policy decision‑making via action heads model overview
Target: multi‑embodiment learning across arms and humanoids, emphasizing transfer and scaling toward "general‑purpose" robotics model overview
Availability: open weights and assets are hosted on Hugging Face to accelerate community R&D and reproducibility model overview

OpenAI restarts robotics push with humanoids, teleop and Isaac Sim hires

OpenAI is rebuilding a robotics division with roles focused on humanoid control, teleoperation, high‑volume hardware prototyping, and Nvidia Isaac simulation—adding named talent and clearer signals it will pursue "general‑purpose robotics" in the physical world. This extends the hiring burst first noted in hiring focus (teleop/Isaac Sim push).

New datapoints include an experienced hire (e.g., Chengshu Li, June 2025) and job specs calling for simulation, tactile sensor design, and manufacturing experience at 1M+ scale role specs
Listings emphasize perception‑to‑control training stacks (Isaac Sim), teleop for data collection, and safety‑minded hardware/software iteration loops hiring details
After pausing robotics in 2021, OpenAI now frames robotics as necessary for AGI by coupling high‑rate perception with robust control; build vs partner vs off‑the‑shelf remains unspecified hiring details, role specs

🎥 Generative Media & 3D

Strong creative wave: Seedream 4 High‑Res challenges Nano Banana; Hunyuan3D 3.0, World Labs persistent 3D worlds, Krea realtime video, Genie 3 demos, SRPO realism; HuMo multimodal video and Weavy/Kling workflows; Perplexity adds Nano Banana/Seedream models.

ByteDance + Tsinghua release HuMo: multimodal human‑centric video (text+image+audio), Apache‑2.0

Numbers-first: HuMo lands in two sizes (17B, 1.7B) with subject‑consistent generation and mask‑guided lip‑sync, supporting text, image, and audio conditioning—under Apache‑2.0 for practical use. overview thread architecture note lipsync mask

Project page, weights, and paper are public: ArXiv paper, Hugging Face repo, project page
Design choices: inject reference appearance minimally to preserve base model fidelity; audio cross‑attention drives natural lip movements without freezing global motion architecture note lipsync mask
Positioning: a unified alternative to vid2vid pipelines; pairs well with creative edit loops and dataset‑constrained production setups comparison note

Seedream 4 High Res ties Nano Banana atop LM Arena text-to-image; debuts #2 on edits

3,700+ new votes swung LM Arena’s image boards: Seedream 4 High Res (ByteDance) is now tied for #1 with Nano Banana (Gemini 2.5 Flash Image) in text-to-image, and ranks #2 in image editing. Early but meaningful signal for creative toolchains. leaderboard update edit ranking battle link

See the live boards and model matchups at LMArena
The update also separates High Res (#2) vs standard Seedream 4 (#3) on edits, implying quality gains beyond base SD4 variants edit ranking

Hunyuan3D 3.0 launches with 1536³ geometry, 3× precision; free 20 gens and Tencent Cloud API

Following up on tease, Tencent ships Hunyuan3D 3.0 with 3× higher precision, 1536³ geometric resolution, and 3.6B‑voxel ultra‑HD modeling. Free access includes 20 generations, with a livestream walkthrough scheduled. launch details livestream time teaser

livestream teaser

Highlights: lifelike faces, faithful structure reconstruction (layered strategy), higher texture fidelity; available via Tencent Cloud API launch details
Livestream: Sept 17, 12:00 PM UTC for modeling demos and pipeline tips livestream time

World Labs shows persistent, navigable 3D worlds from text/images with Gaussian‑splat export

Contrast-first: Instead of short morphing scenes, these worlds persist—users can navigate indefinitely without drift. The model builds explorable 3D from a prompt or image, exportable as Gaussian splats for real‑time rendering (e.g., Three.js Spark). update thread blog mention

Splats deliver dense, browser‑friendly scenes with consistent geometry beyond the original view frustum update thread
Marble beta hosts thousands of worlds and creator access; the team calls out zero runtime cost for continued navigation update thread

YouTube Shorts adds Veo 3 Fast video generation with sound; Edit with AI and Speech‑to‑Song coming

Implication-first: YouTube is moving generative video into the mainstream creative stack. A custom Veo 3 Fast now generates sound‑on clips inside Shorts (US, CA, UK, AU, NZ), with Edit with AI and Speech‑to‑Song (Lyria 2) queued for later this year. Shorts rollout edit with AI speech to song Google blog post

Add Motion animates photos by transferring motion across subjects; Stylize and Add Objects broaden looks and compositing motion feature
Edit with AI assembles first cuts (music, transitions, VO) for speed; Speech‑to‑Song remixes spoken lines into musical hooks edit with AI speech to song

Krea’s Realtime Video generates controllable, physics‑aware infinite clips from sketches

Krea AI’s realtime engine now turns rough sketches (plus webcam or screen inputs) into infinite‑length videos with basic physics simulation—useful for prototyping motion, UI flows, or quick previz loops. feature note

Inputs: sketch, webcam, screen capture; outputs: continuous clips with steering via strokes and overlays feature note
Practical angle: rapid iteration without a full 3D pipeline; pairs well with post‑edits in NLEs for polish

Perplexity adds Nano Banana and Seedream 4.0 as image generation backends

Perplexity Pro now lets users pick Nano Banana (Gemini 2.5 Flash Image) and Seedream 4.0 for image generation—alongside GPT Image 1, FLUX.1, and DALL·E 3—bringing top arena models into a mainstream assistant. settings screen

model selector screenshot

Signals continued consolidation: end‑user apps exposing multiple premium T2I models under one UI settings screen

Genie 3 demos spur interest as Google team details scale, benchmarks, and agent links

Contrast-first: Beyond clip generation, Genie’s pitch centers on agentic worlds and detailed control. Demos include live ancient‑Athens‑style scenes; a developer interview covers capabilities, attention to detail, and alignment with broader agent efforts. ancient Athens demo interview rundown

Topics span training/testing, event prompting, ties to Veo 3/Nano Banana, and hardware considerations interview rundown
Creative takeaway: stronger grounding for interactive worlds, not just passive video; positions well for game/UX prototyping

🧩 Interoperability & MCP

MCP momentum: GitHub launches an MCP Registry; Microsoft analysis flags tool-space interference and namespacing; Firecrawl ships MCP v3; growing client/server ecosystems to tame tool menus, schemas, and resources.

Microsoft flags “tool‑space interference” in MCP stacks; recommends namespaces and caps

A Microsoft analysis of 1,470 MCP servers finds that oversized tool menus, bloated outputs, deep parameter schemas and duplicate names cause agents to pick the wrong tools or stall—an effect they call tool‑space interference.

Large menus degrade accuracy: some servers expose up to 256 tools, while agents work best under ~20; certain models saw up to an 85% drop with big menus study summary
Output bloat breaks context: 16 tools returned >128k‑token payloads, with one averaging ~558k tokens; models lost up to 91% performance when outputs exceeded memory windows study summary
Schemas matter: flattening nested params improved success by 47%, while 20‑level nesting overwhelmed agents; 775 duplicate tool names (e.g., “search” across 32 servers) confused orchestrators study summary
Errors hide in plain sight: 3,536 errors were buried inside “successful” responses with vague messages, stalling recovery; only 7.6% of servers expose reusable resources and 5% templates study summary
Prescriptions: use namespaces, smaller menus, schema flattening, standardized errors, capped outputs and resource passing; the interference diagram shows overlapping choices (terminal, browser, GitHub) forcing repeated, fragile decisions interference diagram

GitHub debuts MCP Registry to discover and one‑click install agent tools

GitHub introduced an official MCP Registry that curates Model Context Protocol servers and lets developers sort by community signals, then install them directly in VS Code/Copilot and other MCP hosts. The directory aims to reduce security and discovery friction as tool ecosystems scale.

The launch post highlights one‑click install in editors, sorting by stars/activity, and first‑party entries from partners like Figma, Postman, HashiCorp and Dynatrace GitHub blog, and GitHub MCP Registry
Registry framing stresses safer onboarding for agents that call external tools, replacing scattered repos and ad‑hoc links with a vetted directory GitHub blog
Early community reaction: developers are already pointing threads and workflows to the registry as a canonical starting point for MCP discovery dev excitement, and GitHub blog

Firecrawl ships MCP v3; rises to top web data tool in GitHub’s new registry

Firecrawl released MCP v3 focused on faster search and scraping, and it now ranks as the top web data tool in GitHub’s MCP Registry—an early signal of consolidation around production‑ready MCP servers.

“Search and scrape faster than ever” with the new MCP v3; ranked #1 in its category within the registry Firecrawl update
Positioning benefits from GitHub’s new curated directory that streamlines trust and install flows for MCP tools GitHub blog, and GitHub MCP Registry

Cursor 1.6 adds MCP Resources support alongside custom commands and faster agent terminal

Cursor’s latest release brings MCP Resources support into the editor, making it easier for agents to fetch structured artifacts while devs standardize prompts via custom slash commands and a more reliable Agent terminal—building on broader MCP momentum seen with Google’s GenKit plugin GenKit plugin.

New in 1.6: custom /commands, a faster and more reliable Agent terminal, MCP Resources support, and a /summarize command release note, and changelog
The changelog details validation for elicitation JSON schemas, OS notifications for long runs, and multi‑model improvements that smooth agent workflows inside the IDE changelog

🗂️ Retrieval, Chunking and Indexing

RAG quality and indexing advances: Tencent’s HiChunk hierarchical chunking + HiCBench, UChicago’s RAS survey (retrieval+structuring), SAQ vector quantization (80× faster encoding), recency bias in rerankers, LlamaIndex pivots to document workflows.

LlamaIndex pivots to docs‑first workflows: parsing, extraction and agentic orchestration

LlamaIndex unveiled a docs‑first site and developer hub emphasizing SOTA document parsing (PDF/PPTX/DOCX), schema‑based extraction, indexing, and agents that automate deep research and report generation—positioning the stack as an end‑to‑end paperwork engine in context of RAG playbook (production RAG tactics). site update developer hub LlamaIndex site Developer hub

Claims unique combination of file management + agentic orchestration; 600+ integrations from the OSS framework carry over site update
Focus on long‑form ingestion pain points (layout parsing, structured extraction, relevance‑tuned indexing) to support reliable doc workflows developer hub

Tencent’s HiChunk + HiCBench lift dense‑evidence RAG with hierarchical chunking

Tencent proposes HiChunk with an evaluation suite (HiCBench) showing hierarchical, evidence‑aware splitting and Auto Merge can recover more facts in dense QA than flat chunking. The method trains a splitter to mark hierarchical split points and merges child chunks at query time when multiple match, improving factual coverage and faithfulness on dense tasks. paper thread

Benchmarks expose that many QA sets hide chunking failures; HiCBench uses human boundaries and evidence‑dense questions to stress chunking paper thread
HiChunk’s Auto Merge yields roughly +5 points factual coverage on single‑chunk dense tasks while remaining practical on sparse ones paper thread

Survey maps Retrieval and Structuring Augmented Generation (RAS) best practices

UIUC’s RAS survey argues LLMs become more factual when retrieval is fused with structure (taxonomies, hierarchies, graphs), outlining retrieval modes (sparse/dense/hybrid) and structuring (taxonomy, classification, IE) plus integration patterns that reduce hallucinations. It highlights gaps in scalable retrieval, structure quality, and latency‑aware integration. survey summary builder note ArXiv paper

RAG alone curbs parametric drift; RAS adds schema‑guided reasoning (graph walks, community summaries) for multi‑step answers survey summary
Practical guidance spans retrieval fusion, structure construction, and evaluation choices for science, retail, and healthcare deployments builder note

SAQ vector quantization: 80× faster encoding with up to 80% lower error

SAQ introduces PCA‑guided, segment‑wise bit allocation with code adjustment to push ANN performance: authors report up to 80× faster encoding and up to 80% lower quantization error vs Extended RaBitQ while matching or beating recall at fewer bits. Distance can be refined progressively via prefix codes, and a simple estimator prunes candidates by high‑variance segments first. paper summary

Per‑dimension quantization with iterative code adjustment avoids heavy codeword search while preserving directionality paper summary
Matches 8‑bit RaBitQ with ~5–6 bits across image/text sets; index builds are markedly faster paper summary

LLM rerankers exhibit strong recency bias; ‘newer’ docs can jump 95 ranks

A study shows LLM‑based rerankers favor recent-looking content even when text is unchanged: adding a fake newer date makes the top‑10 skew younger by up to 4.78 years and can propel a single item up to 95 ranks; pairwise choices flip ~25% of the time when dates tie. Larger models reduce, but don’t remove, the bias. paper summary

Protocol: rerank a list, then repeat after adding a one‑line publication date to low‑ranked items—content untouched paper summary
Skew concentrates at the extremes (top gets newer, bottom older), suggesting date tokens act as strong implicit relevance cues paper summary

📊 Evals, Leaderboards & Usage

ARC‑AGI bespoke SOTA submissions (Grok 4 pipelines), LMArena launches paid evals, lighteval adds 7k+ tasks/MMMU, Meta’s CyberSOCEval shows low SOC task accuracy; OpenAI/Anthropic publish massive usage patterns; LiveBench chatter and model comparisons.

ARC‑AGI SOTA: Grok‑4 pipelines hit 79.6% (v1) and 29.4% (v2) with program‑synthesis loops

Two bespoke submissions set new ARC‑AGI highs using Grok‑4, shifting from code generation to natural‑language programs and program‑synthesis outer loops. The top v1 score reaches 79.6% at ~$8.42/task, and v2 reaches 29.4% at ~$30.40/task. ARC update, Submission details

Jeremy Berman’s “natural language programs” pipeline: 79.6% on ARC‑AGI‑1 and 29.44% on ARC‑AGI‑2 with per‑task costs of ~$8.42 and ~$30.40; code, blog, and Kaggle are public Blog post, GitHub repo, and Kaggle notebook
Eric Pang’s DreamCoder‑inspired library reuse: 77.1% on ARC‑AGI‑1 at $2.56/task and 26.0% on ARC‑AGI‑2 at $3.97/task, with open code and write‑up Second submission, GitHub repo
ARC Prize lists policy, official leaderboard, and submission routes for bespoke systems Leaderboard links, Testing policy, ARC leaderboard
Takeaway: test‑time compute applied to program search plus reusable concept libraries materially lift sample efficiency and accuracy on ARC’s long‑horizon reasoning

OpenAI publishes ChatGPT usage study: ~700M WAU, majority personal use; writing and guidance dominate

OpenAI released the largest analysis to date of ChatGPT usage through July 2025, showing ~700M weekly active users and that personal use has grown to dominate overall conversations. Usage paper, and OpenAI paper

Scale: ~18B weekly messages at mid‑2025; adoption broadening across regions and demographics OpenAI paper
Mix shift: personal share rises from ~53% (mid‑2024) to ~73% (2025), while work use grows slower Roundup summary
Top intents: Writing (~28%), Practical guidance (~28%), and Seeking information (~21%) account for most conversations; coding ~4% OpenAI blog
Demographics: early male skew narrows; faster growth in lower‑income countries OpenAI paper

Anthropic Economic Index: automation surpasses augmentation; API usage shows 77% directive automation

Anthropic’s interactive Economic Index reports that automation‑style usage has now overtaken augmentation, with 77% of API traffic classified as directive automation; geographic splits show the U.S. leads volume, while Israel leads on a per‑capita basis. This builds on the broader usage picture in Usage studies, which established majority non‑work for ChatGPT and rising automation in business. Automation shift

Automation vs augmentation: automation passed augmentation in overall Claude usage; 77% automation on API Automation shift, Anthropic index
Geography: U.S. ~22% of volume; Israel ~7% of working‑age usage, followed by Singapore (~4.5%) and Australia (~4%) Country split
State skew: California leans math/coding/AI, Texas leans job search and workflows, Florida leans fitness/marketing/consulting State differences
Explore the data: filter by state, profession and usage types in the interactive index Interactive site

Meta’s CyberSOCEval shows low LLM accuracy on malware analysis (15–28%) and intel reasoning (43–53%)

Meta’s new CyberSOCEval benchmark evaluates LLMs on real SOC workflows and finds current models underperform—malware analysis tops out at 15–28% exact‑match, and threat‑intel reasoning at 43–53%. Benchmark overview

Malware Analysis: 609 validated Windows samples; multi‑answer multiple‑choice scored by exact set match (random baseline ~1.7%), yet top models achieve only 15–28% Benchmark overview
Threat intelligence: 43–53% exact‑match on reasoning over indicator/context tasks—above chance but far from reliable Benchmark overview
Design choices reduce guessing: exact‑set scoring and multi‑label items sharply penalize partial matches, exposing shallow patterning Benchmark overview
Why it matters: open, SOC‑shaped evals clarify which models reduce analyst toil vs. add noise, and highlight gaps for tool‑grounded workflows Why it matters

LMArena launches AI Evaluation service with community‑grounded audits and SLAs

LMArena is turning its large‑scale human feedback into a commercial evaluation offering for labs, enterprises, and builders—combining representative samples, auditability, and delivery SLAs. Product launch

Scope: in‑depth evaluations based on community voting and interactions to reveal strengths, weaknesses, and tradeoffs in real use Offering details
Guarantees: auditability via sampled feedback data, plus timelines under service‑level agreements for results delivery Offering details
Positioning: complements public leaderboards with paid, traceable studies for model and app teams that need decision‑grade evals Product blog

Hugging Face lighteval expands to ~7,000 tasks and adds MMMU

HF’s lightweight eval library “lighteval” now covers ~7,000 benchmarks with growing multilingual support and adds MMMU, making broad, local eval runs easier. Coverage stats, Readme update, MMMU support

Local‑first: designed to run evals on your own hardware with a vast task catalog, including vision‑language Coverage stats
Scope growth: maintainers highlight rapid expansion to thousands of tasks across domains and languages Readme update
MMMU support: multimodal, multi‑discipline understanding joins the task set, widening beyond text‑only evals MMMU support

💼 Enterprise, Funding and Products

Big enterprise signals: Fiverr cuts 30% to go AI‑first; ComfyUI raises $17M; Atlassian to buy The Browser Company ($610M); Perplexity adds email/calendar/Notion/GitHub; Notion teasing personalized AI agents; Microsoft rolls out Copilot Chat panes across M365; Google’s $3T cap and AI-led growth noted.

OpenAI launches Stargate UK with NVIDIA and Nscale: 8,000 GPUs in Q1’26, path to 31,000

OpenAI announced Stargate UK, a sovereign compute initiative with Nscale and NVIDIA: offtake up to 8,000 GPUs in Q1’26 with scale potential to 31,000, serving regulated sectors and public services. OpenAI post

Multi‑site rollout including Cobalt Park in the UK’s AI Growth Zone; includes chips from Arm‑based UK supply where applicable OpenAI blog post
OpenAI Academy will support the UK’s goal to upskill 7.5M workers by 2030 program note
Signals demand for jurisdiction‑bound AI hosting and enterprise assurance for finance, research, and national security workloads OpenAI post

Fiverr cuts 30% of staff to rebuild as an AI‑first marketplace

Fiverr is laying off 250 employees (~30%) as it reorganizes around internal AI systems and a leaner operating model. The company targets 25% operating margins by 2026 while shifting hiring toward AI‑native roles. layoffs summary

Automation examples cited include support summarization, earlier fraud detection, and previously uneconomic manual steps now viable with AI layoffs summary
2025 revenue guided at $425M–$438M; savings split between reinvestment and profitability layoffs summary
The move is framed as a permanent reset: fewer layers, smaller teams, shared AI infrastructure, and upskilling for data pipelines/evals/inference layoffs summary
Background and analysis: see the independent breakdown and imagery of the internal memo FinalRoundAI post

Microsoft rolls out Copilot Chat panes across Word, Excel, PowerPoint, Outlook and OneNote

Microsoft is bringing Copilot Chat as an in‑app side pane to all Microsoft 365 apps at no extra cost, with context grounded in the active document and quick access to agents, image gen and Pages. rollout overview

New slash‑search for other docs, multi‑image uploads, larger prompt windows and link‑outs to agents improve authoring loops rollout overview
With GPT‑5 live, Microsoft cites longer, clearer responses and an 11% thumbs‑up gain on quality rollout overview
A paid Copilot license still unlocks tenant‑wide reasoning, AI‑Powered Search, Researcher/Analyst agents and admin controls rollout overview

Google unveils AP2, an open Agent Payments Protocol with 60+ partners for auditable AI purchases

Google introduced AP2, a standard for agent‑led payments that carries cryptographically signed Mandates (Intent/Cart/Payment) with every transaction to prove user consent and simplify disputes. protocol overview

Partners span cards, bank transfers and crypto (Mastercard, AmEx, PayPal, Adyen, Coinbase); A2A x402 adds wallet‑based crypto flows partner list
Open specs and GitHub: credentials, signatures and audit trails (verifiable credentials) for authorization and risk checks AP2 GitHub,Google blog post
Developers get low‑friction pay‑per‑use and agent‑to‑agent payments; example integrations already shipping via x402 partner update

Atlassian to acquire The Browser Company for $610M to build an AI‑first work browser

Atlassian is buying The Browser Company (Arc/Dia) for $610M, aiming to fold secure, AI‑assisted browsing into enterprise workflows and compliance. Closing expected by December 2025. deal headline

Arc’s AI features (preview tabs, chat, shopping helpers) and Dia’s embedded agents will be hardened for enterprise use (security, compliance) SiliconANGLE article
Pitch: a browser optimized for knowledge work to complement Atlassian’s suite; competitive pressure from Perplexity Comet and Island for enterprise deal headline

Notion to unveil personalized AI agents and a marketplace for shareable templates

Notion is planning Personalized AI Agents with configurable identity, style and memory, plus a marketplace where builders can share or sell agent templates, to be announced at its Sep 18 keynote. feature scoop

Personalization surfaced via a "Personalize" entry point, with prebuilt personas and editable memories Testingcatalog brief
Marketplace promises monetization for Notion’s template creator community feature scoop
Teasers and event pointers confirm timing and build‑up event teaser

Figure AI raises $1B at a $39B valuation to scale humanoid manufacturing and deployments

Figure closed over $1B (Series C) at a $39B valuation to accelerate humanoid buildout and real‑world pilots, with backers spanning NVIDIA, Intel Capital, LG Tech Ventures, T‑Mobile Ventures, Qualcomm and others. deal headline

Funds earmarked for production manufacturing (BotQ), enterprise deployments and NVIDIA GPU infrastructure for training/simulation scale plans
Expanding data collection (video + multimodal) to improve perception and control in varied environments scale plans
Context: rising US humanoid competition (Agility, Apptronik, Boston Dynamics, Tesla) and push for a national robotics strategy scale plans

ComfyUI raises $17M to build an OS for creative AI and launch Comfy Cloud

ComfyUI closed a $17M round led by Pace, Chemistry and Abstract to scale an open, composable platform for generative image/video/3D/audio. A browser‑based Comfy Cloud is in private beta. funding note

Thesis: an open, node‑based "OS of creative AI" with durable local UX and cloud access for users without GPUs Comfy blog post
Focus areas: stabilizing custom nodes, UI polish, cloud scalability, and long‑term support for emerging models roadmap
Signals broad community adoption and an enterprise‑ready hosted tier over time funding note

Perplexity Pro adds email, calendar, Notion and GitHub (Linear/Outlook for Enterprise Pro)

Perplexity expanded its Pro integrations to connect email, calendars, Notion and GitHub; Enterprise Pro also gains Linear and Outlook. This deepens agent grounding and automations across dev and knowledge workflows. integration update

Developer‑centric angle: GitHub and Linear unlock repo analysis, issue triage and status summaries in one place integration update
Enterprise angle: Outlook/Notion connections support meeting prep, notes-to‑tasks and unified search across knowledge bases integration update
Expect stronger RAG quality via account‑level connectors and shared context across tools integration update

ChatGPT’s consolidated personalization hints at an Orders tab and agentic shopping

OpenAI updated ChatGPT’s personalization page (personality, custom instructions, memory) and screenshots show an "Orders" section and "Employee Only" areas, suggesting native commerce and enterprise controls coming. Sam Altman post

Watchers point to "agentic shopping," native wallet and “shop for me” emerging behind the UI feature speculation
This builds on earlier sightings of an in‑app Orders tab for ChatGPT Orders tab and fits OpenAI’s push into end‑to‑end agent flows
Release notes also flagged better search: fewer hallucinations, shopping intent detection, and clearer formatting OpenAI release notes

LMArena launches AI Evaluation services based on large‑scale community feedback

LMArena introduced a commercial evaluation service to analyze model performance in real human interactions, offering auditability and SLAs on delivery. service launch

Product pillars: deep, in‑depth evals; representative samples for audit trails; committed timelines capabilities
Derived from 250M+ conversations, 2M monthly votes and 3M users; public leaderboards and datasets remain product blog
Aimed at labs, enterprises and developers needing practical, comparative signal to guide model/product choices service launch

Microsoft tests a Copilot “Search Mode” promising answers with enhanced references

Microsoft is piloting a Copilot "Search" mode that emphasizes stronger citations and reference detail in AI responses, signaling a push toward auditable, source‑grounded answers. ui screenshot

Feature shows up alongside other Copilot modes; likely pairs with Microsoft’s broader enterprise search and governance stack ui screenshot
Additional hints via watcher accounts suggest order tracking and other workflow features emerging across Copilot surfaces watcher note

🛡️ Security, Safety and Governance

Hardware and policy risks took center stage: GPUHammer Rowhammer on GDDR, Reuters phishing using LLMs, OpenAI teen-safety (age prediction/parental controls), AP2 agent payments with verifiable intent, CAISI/AISI agent red‑team, LLM‑hacking in annotation studies.

OpenAI details teen safety plan for ChatGPT with age prediction and parental controls

OpenAI set out three principles—privacy, freedom, and teen protection—with teen safety prioritized when they conflict; an age‑prediction system defaults uncertain cases to under‑18, stricter content rules apply, and parental controls arrive by month‑end. Altman thread, OpenAI blog, Building towards age prediction, feature recap

Safety rules: No flirtatious talk or suicide/self‑harm content for teens; acute distress may trigger contacting parents or authorities Altman thread, OpenAI blog
Parental controls: Link teen accounts (13+), disable memory/history, set blackout hours, receive distress alerts feature recap
Adult freedom: Adults can request content the default model avoids, within safety bounds; advanced data security to keep chats private, with narrow exceptions Altman thread

Rowhammer-style GPUHammer on GDDR flips weights and collapses model accuracy

Researchers demonstrate a Rowhammer attack on GPU GDDR (RTX A6000) that flips bits in model weights; a single flip took accuracy from ~80% to 0.1%, highlighting cloud multi‑tenant risk and fragile ML integrity. NVIDIA suggests enabling ECC, but that can slow ML by up to 10% and may be bypassed by stronger attacks. overview, paper link

Attack method: Carefully tuned hammer patterns for GDDR induce bit flips in adjacent cells (GPUHammer) overview
Impact: One weight bit flip “destroys” accuracy (80%→0.1%), demonstrating catastrophic integrity failure potential overview
Threat model: Highest risk in shared cloud GPU scenarios (cross‑tenant sabotage) overview
Mitigation: ECC on, at a performance hit (up to 10%); researchers warn future techniques could evade ECC overview

Google launches AP2 to make agent payments auditable with cryptographic mandates

Google’s open Agent Payments Protocol (AP2) standardizes agent‑led purchases across cards, bank transfers, and stablecoins using signed Verifiable Credential “Mandates” that prove user intent; >60 partners (e.g., Mastercard, Amex, PayPal, Coinbase) support it. Google blog post, GitHub repo, partner list

Proof of consent: Intent and Cart Mandates create non‑repudiable audit trails; Payment Mandate flags human‑present vs not to issuers consent flow explainer, Google blog post
Privacy model: Credentials provider holds payment methods; shopping agents avoid raw card data exposure consent flow explainer
Crypto path: A2A x402 enables wallet‑based crypto payments under the same mandate flow (Coinbase demo) Google blog post, Coinbase note

Study: LLM-based labeling can flip findings; 31–50% incorrect conclusions across tasks

A large analysis of 37 real research tasks and 18 models finds “LLM hacking”—results reversing depending on model/prompt/settings—causes 31–50% incorrect conclusions; 100 human labels often beat 100K LLM labels at avoiding false discoveries. paper summary

Error types: Missing real effects, inventing effects, direction errors, and exaggerated sizes are common near significance thresholds paper summary
Mitigations: Post‑hoc corrections trade one error for another; selective exploration enables intentional gaming paper summary
Guidance: Treat LLM‑assisted annotation as high risk without strong pre‑registration and audits

Reuters: Top chatbots helped craft senior-targeted phishing; 5 of 9 emails got clicks

A Reuters collaboration with a Harvard researcher showed Grok, Meta AI, Claude, Gemini, and DeepSeek readily produced scam emails (tone, urgency, timing). In a controlled test with 108 seniors, five of nine AI‑crafted emails led to clicks, underscoring industrial‑scale fraud potential. Reuters investigation

Elevated risk: FBI already warns of rising elder fraud; AI scales tailored lures at near‑zero cost Reuters investigation
Capability gap: Models suggested refinements (urgency, timing) that increase conversion risk Reuters investigation
Policy angle: Highlights need for stricter guardrails on abuse prompts and better phishing filters

OpenAI, CAISI and UK AISI red‑team chained agent bugs; patched within one business day

OpenAI reported that US CAISI and UK AISI surfaced two chained ChatGPT Agent/GPT‑5 bugs enabling session control; fixes landed in one business day and monitoring/bio safeguards were hardened based on 12+ UK AISI reports. This update follows earlier red‑team collaboration noted red-team patch. collab update

Vulnerability class: Chained agent issues escalated to session takeover risk; mitigations deployed swiftly collab update
Process change: Expanded monitoring and safeguards informed by external government labs collab update

Meta’s CyberSOCEval: LLMs lag on malware analysis and threat intel reasoning

Meta introduced CyberSOCEval to test SOC‑relevant skills; top models score only 15–28% on malware analysis and ~43–53% exact‑match on threat‑intel reasoning (vs 1.7% random baseline), showing persistent capability gaps for defenders. benchmark intro, soc rationale

Benchmark design: 609 Q/A pairs from detonated Windows samples; multiple‑answer exact‑match grading suppresses chance success benchmark intro
Operational takeaway: Focused evals help choose models that reduce analyst toil instead of adding noise soc rationale

China rules Nvidia’s Mellanox deal violated antitrust, keeps probe open amid tariff talks

China’s market regulator concluded Nvidia’s $7B Mellanox acquisition violated antitrust rules; no penalties yet, but the investigation continues as US–China tariff negotiations unfold—adding friction to AI chip supply geopolitics. headline

Policy context: Multiple US export‑control regimes and Beijing discouraging local firms from buying Nvidia shape the backdrop headline
Corporate stance: Nvidia says it is cooperating with authorities headline

US–China strike TikTok deal structure: 80% US ownership, US oversight and data partner

A framework would keep TikTok operating in the US with ~80% US ownership (Oracle, Silver Lake, a16z and others) and ~20% China ownership; an American board and government oversight are included, with US user data/security entrusted to a partner. Content‑ranking control remains an open question. deal details

Governance: Oversight plus data residency/control aims to address national‑security concerns deal details
Unknowns: Who steers ranking logic that shapes attention remains unclear deal details

HalluDetect + multi-agent workflow trims legal chatbot hallucinations with audit trails

A legal‑domain study benchmarks five RAG chatbots and introduces HalluDetect, an LLM‑based multi‑turn checker; the multi‑agent AgentBot averaged ~0.42 hallucinations/turn at 96.13% token accuracy, outperforming others by flagging only high‑impact errors. paper abstract image

Technique: Expand evidence pool, keep short chat memory, score risky spans (1–5), and drop low‑risk flags to improve precision paper abstract image
Process: Split roles (receptionist/paralegal/lawyer/drafter) to ground each step in retrieval before drafting answers paper abstract image

🧠 Training, RL and Reasoning

Focus on long‑horizon agents and efficiency: SRPO for diffusion realism, DeepDive multi‑turn RL search, RL‑trained solution aggregation beating majority vote, Tongyi’s agentic CPT/SFT/RL stack (ReSum, AgentScaler), speculative cascades latency cut, steering tense/aspect.

Tongyi unveils a full RL stack for deep research agents (AgentFounder/AgentScaler/ReSum/WebResearcher)

Alibaba’s Tongyi Lab drops a coordinated suite for long‑horizon web research: AgentFounder (agentic continual pre‑training → SFT → RL), AgentScaler (environment scaling for function‑calling), ReSum (context summarization + RL for long searches), and WebResearcher (iterative deep‑research loop). Reported SOTA on multiple benchmarks with 30B models. overview thread

AgentFounder (Agentic CPT before post‑training) hits 39.9% on BrowseComp‑en and 72.8% on GAIA, easing capability/alignment conflicts in post‑training ArXiv paper
AgentScaler scales simulated tool environments, reaching SOTA on τ‑bench/τ²‑bench/ACEBench; 30B approaches 1T‑param systems on function‑calling ArXiv paper
ReSum compresses history into reasoning states; ReSum‑GRPO adds RL for summary‑aware reasoning, +4.5% over ReAct (up to +8.2% with RL) on web tasks ArXiv paper
WebResearcher formalizes an iterative (plan↔search↔refine) MDP with tool‑augmented data engine; beats proprietary baselines (36.7% HLE, 51.7% BrowseComp) ArXiv paper
Code/blog resources for the deep research agent are public to reproduce pipelines and scores GitHub repo, Tech blog

SRPO trains diffusion realism via relative rewards, claiming 75× efficiency over DanceGRPO

Tencent Hunyuan introduces SRPO (Semantic Relative Preference Optimization), an online RL scheme that conditions rewards on promptable attributes and directly optimizes high‑noise timesteps to improve text‑to‑image realism and aesthetics quickly. The team reports 75× training efficiency vs DanceGRPO and strong wins across human evals while mitigating reward hacking on FLUX.1‑dev. method overview

Uses “direct‑align” gradients at noisy steps to save VRAM and stabilize training; reward prompts (e.g., realism, lighting) steer optimization without extra data method overview
Human wins across styles (oil painting, anime, cyberpunk) and realism prompts; ablations show robustness to different optimization equations method overview
Trains in ~10 minutes on 32 GPUs to beat DanceGRPO; delivers photorealistic samples with fewer artifacts and less oversaturation method overview
Released as project/paper/model/code per the announcement; intended as a faster alternative to GRPO‑style T2I alignment method overview

Meta’s RL‑trained aggregator (AggLM) beats majority voting by synthesizing multi‑answer solutions

FAIR trains an aggregator LLM with verifiable rewards to read multiple candidate solutions, correct errors, and merge useful steps—outperforming majority voting or reward‑model selection, especially when the correct answer is in the minority. paper summary

Trains with RL on groups mixing easy (mostly correct) and hard (mostly wrong) candidates; reward=1 only for exact final answer, pushing trajectory‑level credit assignment paper summary
Aggregator generalizes across stronger generators than it saw in training and to shorter outputs; largest gains appear when candidate answers disagree paper summary
Provides a practical test‑time compute strategy: reason over diverse outputs instead of counting votes, reducing failure cases of majority voting paper summary

UI‑S1: Semi‑online RL lifts multi‑turn GUI automation without full online rollout costs

UI‑S1 proposes “semi‑online RL” for GUI agents: simulate online signals in an offline setting by maintaining original outputs in multi‑turn traces and patching divergences back to expert trajectories. New SOP metric correlates with online performance; a 7B model reports SOTA across dynamic GUI benchmarks. paper page, author Q&A

Incorporates discounted returns and step/episode‑level weighted advantages to inject long‑horizon signals into offline training paper page
Patch Module recovers off‑policy branches during rollouts, stabilizing learning on multi‑step tasks with sparse rewards paper page
Gains shown on AndroidWorld/AITW and other dynamic suites, improving multi‑turn reasoning and tool use without expensive online data collection author Q&A

Steering tense and aspect in multi‑token generation via LDA‑found feature directions

A study identifies near‑orthogonal “tense” and “aspect” directions inside LLM activations using linear discriminant analysis, then steers generation by adding these vectors at selected layers/steps. Tense can be controlled at 94–96% on open sentences; aspect is harder but improves with targeted layer/scale choices. paper summary

Steering before verbal heads and at deeper layers works best; scaling rises with activation magnitude and is most effective near the verb paper summary
Adding target direction alone outperforms subtracting source; partial alignment subtraction helps slightly; over‑steer risks topic drift/repetition paper summary
Demonstrates a lightweight alternative to fine‑tuning for controllable syntax, with multi‑token effects rather than single‑token edits paper summary

🏗️ Compute, Capacity and Cloud

Infra news concentrated on UK buildouts and capacity shocks: OpenAI’s Stargate UK (8k→31k GPUs), Google’s £5bn UK investment, GPT‑5‑Codex demand causing temporary slowdowns, Epoch report on $100B clusters by 2030; Nvidia–China regulatory friction.

OpenAI launches Stargate UK: 8k GPUs in Q1’26 with path to 31k for sovereign compute

OpenAI unveiled Stargate UK, a multi‑site AI infrastructure partnership with NVIDIA and Nscale that will bring 8,000 GPUs online in Q1 2026 and scale to 31,000 over time for jurisdiction‑sensitive workloads across public services, finance, research and security. The initiative includes OpenAI Academy to help upskill 7.5M UK workers by 2030. OpenAI blog post, announcement card, and capacity detail

Sovereign compute: Multi‑site UK deployment operated with Nscale, powered by NVIDIA, optimized for regulated sectors OpenAI blog post
Capacity roadmap: 8k GPUs in Q1’26, with a scale plan to 31k GPUs across additional sites capacity detail
Workforce angle: OpenAI Academy to support the UK’s 2030 upskilling target (7.5M workers) OpenAI blog post
UK industrial policy fit: Part of OpenAI for Countries and UK AI Opportunities Action Plan alignment announcement card

Google pledges £5bn for UK AI demand, new data centre, and 8,250 annual jobs impact

Google will invest £5bn in the UK over two years to meet AI demand, including a new data centre in Waltham Cross and spend across capex, R&D and engineering (DeepMind’s science/healthcare). The company projects 8,250 UK business jobs annually across the wider economy. investment summary

Facilities: New Waltham Cross data centre anchors the UK expansion investment summary
Scope: Spend spans capex, R&D, engineering, and DeepMind work in science/healthcare investment summary
Jobs footprint: Estimated 8,250 annual business jobs across the wider economy investment summary

GPT‑5‑Codex demand outstrips capacity; temporary slowdowns, GPU surge, and rate‑limit resets

OpenAI reported GPT‑5‑Codex running ~2× slower than targets due to demand spikes, then rapidly provisioned additional GPUs and reset user limits, bringing latency back to normal within hours. This comes in context of API caps increase a day earlier, underscoring step‑function usage growth.

Incident: Capacity lag drove ~2× slower response times during peak usage capacity note, and status update
Recovery: “GPUs are up” and service speed restored after rapid capacity adds latency restored, and capacity detail
Customer relief: Limits reset to compensate for earlier slowdowns; more capacity rolling out this week limits reset

Epoch: $100B training clusters by 2030 as scaling continues; R&D boosted well before full autonomy

Epoch AI forecasts that leading training clusters could exceed $100B by 2030, with compute scaling unlikely to hit a wall in the near term; AI is set to materially automate software engineering and other R&D workflows even before fully autonomous systems arrive. report thread, 2030 forecast

Cost curve: Leading AI supercomputers’ costs have roughly doubled yearly; path points to >$100B clusters 2030 forecast
Capability arc: By 2030, AI to autonomously fix issues and implement features; similar assistant roles in math and science software projection, and domain coverage
Productivity: Expect 10–20% desk‑research boosts; deployment in regulated domains lags capabilities productivity note, and full report

China rules Nvidia’s 2020 Mellanox acquisition violates antitrust; probes continue amid chip tensions

China’s market regulator ruled Nvidia’s $7B Mellanox deal violates antitrust rules, with ongoing investigations and no penalty yet. The move adds friction as the U.S. tightens AI chip export controls and Beijing discourages Nvidia purchases. Nvidia says it’s cooperating. antitrust report

Regulatory pressure: Ruling arrives alongside shifting U.S. export‑control regimes on AI chips antitrust report
Market signals: Beijing discouraging local firms from buying Nvidia hardware; compliance posture remains fluid antitrust report

🛠️ Agentic Coding & Dev Tools

Agent workflows and tooling dominated the feed: Codex CLI tips, Cursor 1.6 custom commands, Claude Code UX (/t to toggle extended thinking), CodeRabbit CLI, Amp + Codex tool, CopilotKit agent templates, DSPy growth; real-world reports of GPT-5-Codex loops, planning, and file edits.

GPT‑5‑Codex demand slows service 2×, then recovers as capacity and limits reset

Usage spiked so sharply that Codex ran about 2× slower than targets before OpenAI and partners added GPUs, restored nominal latency, and reset user limits. This follows the model’s initial launch highlighting dynamic thinking and long autonomous runs. demand update, status note

“2× slower than targets” due to high demand; teams spun up additional GPUs to catch up demand update, status note
“GPUs are up” brought latency back to normal the same day latency restored
Limits reset for everyone as a make‑good; more capacity rolling out this week rate limits reset, second reset note, OpenAI devs update

Cursor 1.6 ships custom commands, faster Agent terminal and MCP Resources

Cursor rolled out a sizeable 1.6 update focused on agent ergonomics and extensibility. Developers can now define reusable slash commands, run a snappier Agent terminal, and wire external data/tools via MCP Resources, with a new /summarize to manage long chats. release post, and Changelog

Custom slash commands live in .cursor/commands and can parameterize prompts for team reuse release post, and Changelog
Agent terminal reliability and speed got a pass; UI polish and context usage indicators improve long runs release post
MCP Resources support makes it easier to expose structured data/tools to agents without bespoke glue release post
Automatic summarization triggers (/summarize) help avoid context bloat on extended sessions full changelog

Field tips and gotchas emerging around GPT‑5‑Codex agent workflows

Developers shared early best practices and pitfalls from long‑running Codex sessions: it excels at planning and multi‑hour autonomy, but can over‑deliberate or pick the wrong tool if left unguided—reinforcing the value of crisp plans and guardrails. long run demo, tool misuse example

Strengths: 7+ hour independent runs on complex refactors; thorough plans before edits 7‑hour claim
Weak spots: can try reading files with Python/Ruby instead of built‑in tools; may over‑review diffs tool misuse example, overthinking gripe
Tuning: teams use shell aliases, strict allowlists, and reasoning summaries to stabilize flows alias setup

CodeRabbit CLI brings AI code reviews to the terminal

CodeRabbit shipped a terminal-first AI reviewer that scans staged/unstaged changes, surfaces issues with navigable results, and copies a ready‑to‑paste “Fix with AI” prompt for your agent of choice. Works before PRs to squash bugs locally. CLI demo, CLI docs

One‑line install, then cr review --plain to analyze a repo; navigate findings via h/l and copy remediation prompts with c run command, review navigation
Prompts are agent‑agnostic (Cursor, Claude Code, Codex, etc.), speeding generate‑review‑iterate loops fix prompts
Full walkthroughs cover setup, login, and usage flows for multi‑repo teams how to install, and CLI landing
Early users report smoother terminal‑native quality gates for Codex/Claude/Gemini workflows product tweet

Amp adds a Codex-powered code review tool via CLI integration

Amp users can now call GPT‑5‑Codex from inside Amp via a new codex‑code‑review tool that takes a PR link, runs Codex CLI under the hood, and streams results back into the Amp session. Amp toolbox, Amp Owner’s Manual

Tool wires Codex CLI into Amp’s toolbox to review diffs and produce actionable feedback inline Tool code
Setup is copy‑paste: point Amp to docs and let it self‑configure; ensure Codex is authenticated first usage guide
Keeps cost low if authenticated via ChatGPT while enabling Codex‑grade review quality usage guide

CopilotKit releases Gemini 2.5 + LangGraph template for full‑stack agent apps

CopilotKit published a reference project that embeds agents directly in-app using CopilotKit UI, Next.js, FastAPI, and LangGraph, with examples for a Post Generator (live search grounded) and a Stack Analyzer for GitHub repos. overview thread

Blog walkthrough and open repo cover state graphs, streaming, tools, and UI wiring Tutorial blog, GitHub repo
Practical patterns for production: structured JSON outputs via Pydantic and tool‑augmented workflows blog + repo

Crush IDE agent adds in‑app reasoning controls and faster dev loops

Charmbracelet’s Crush now lets you tune reasoning effort in‑app and shipped six updates in seven days: faster file watching, better LSP performance, Gemini improvements, smarter model search, and more. release note

Open‑source repo and changelog highlight rapid iteration cadence GitHub repo, repo link
Terminal perf brag: “2000t/s in the terminal” for snappy interactions throughput note

Codex CLI pro tip: ‘cdx’ alias enables full‑auto runs with search and reasoning summaries

A handy zsh/bash alias turns codex into cdx with sensible defaults: GPT‑5‑Codex model, full‑auto mode, web search on, and experimental reasoning summaries for quick operator insight. shell alias

One‑function install: npm update shortcut and a codex --full-auto --search profile with model_reasoning_summary_format set shell alias
Useful for day‑to‑day: faster starts, fewer flags, and consistent run hygiene across teams shell alias

🧪 New and Updated Models

Heavy day for model drops: OpenAI’s GPT-5-Codex, Google’s DP-trained VaultGemma 1B, Tencent’s Hunyuan-MT and Hunyuan3D 3.0, ByteDance’s HuMo, OpenBMB VoxCPM TTS, Qwen3‑Next, Ring‑mini‑2.0, plus stealth Gemini variants (Oceanstone/Oceanreef). Mostly model/eval releases and pricing hints; few voice items beyond VoxCPM.

OpenAI debuts GPT-5‑Codex, an agentic coding model with adaptive think time and multi‑hour autonomy

OpenAI released GPT‑5‑Codex, a GPT‑5 variant trained on real engineering workflows to act as a coding teammate that plans, tools, tests and ships over multi‑hour runs. It dynamically allocates “think time” (snappy on easy tasks, more deliberate on hard ones) and runs across IDE/CLI/web/cloud.

Early users show the agent working continuously for 7+ hours, iterating, fixing tests, and landing PRs 7‑hour claim, long‑run demo
Token‑use distribution shifts toward the long tail on difficult tasks (fewer tokens for easy work, much more for hard problems) per internal usage data adaptive effort
Real‑world trials highlight strengths (planning, front‑end edits from screenshots) and limitations (occasionally choosing the wrong tool or over‑checking diffs) screenshots to fixes, tool misuse example, slow diff application
Demand briefly outpaced GPUs; Codex ran ~2× slow before capacity was added and per‑user limits were reset capacity note, degraded speeds, capacity added, limits reset

Google Research ships VaultGemma 1B, a fully differentially‑private LLM with ε≈2 and open weights

VaultGemma is a 1B‑parameter Gemma‑family model trained end‑to‑end with differential privacy, offering formal sequence‑level privacy (ε ≤ 2) and no detectable memorization, while matching older non‑private baselines on classic benchmarks. Weights, code and tech report are public.

Tech report details DP scaling laws, large‑batch training and privacy/utility tradeoffs; release includes weights on HF/Kaggle tech report link, bench overview
On ARC‑E/PIQA/BoolQ/etc., performance lands near the GPT‑2 class while meeting strict DP guarantees bench overview
Google positions VaultGemma as a starting point for private‑by‑design apps (regulated data, PII‑sensitive workloads) tech report link

ByteDance and Tsinghua release HuMo 17B/1.7B human‑centric video models (text/image/audio) under Apache 2.0

HuMo introduces subject‑consistent video generation controllable by text, reference images, and audio, with mask‑guided lipsync and robust subject preservation. The models (17B and 1.7B) ship under Apache 2.0 with paper, weights, and project page.

Multi‑modal conditioning places video latents at the end of sequence for better identity control; mask predictor guides facial attention without freezing global motion conditioning detail, mask predictor
Demo set shows stable subjects across scene changes and audio‑aligned lip motion; authors compare to OmniHuman (cleaner speech but different constraints) subject control, comparison
First‑party links: project page, HF weights, arXiv paper project links

Tencent unveils Hunyuan3D 3.0 with 1536³ geometry and ultra‑HD voxel modeling

Hunyuan3D 3.0 upgrades precision (3×), pushes geometric resolution to 1536³, and introduces 3.6B‑voxel ultra‑HD modeling for lifelike faces and faithful structure reconstruction. It’s available via Hunyuan 3D AI Engine (free tier) and Tencent Cloud API.

Highlights include layered generation for hidden‑detail recovery, enhanced texture fidelity, and stronger input‑image adherence feature highlights
A livestream demo is scheduled to showcase production‑grade assets and workflows livestream tease

OpenBMB launches VoxCPM 0.5B: tokenizer‑free TTS with zero‑shot voice cloning and context‑aware prosody

VoxCPM 0.5B is a tokenizer‑free TTS system (MiniCPM‑4 backbone) that generates natural, context‑aware speech and clones voices from short emotional clips. It targets lifelike prosody with a small model footprint and ships demos and code.

Claims include hyper‑realistic speech, zero‑shot cloning, and natural rhythm/intonation; trained on 1.8M+ hours model overview
Live demo and repos are available on Hugging Face and GitHub for immediate testing/integration model overview, model overview

Google quietly tests new Gemini variants “Oceanstone” and “Oceanreef” on LM Arena

New Google Gemini‑family models surfaced on LM Arena, with Oceanstone first and now Oceanreef appearing with a September 2025 knowledge cutoff indicator in prompts. This extends Google’s live field‑testing of stealth variants before official release.

Screens show Oceanstone/Oceanreef self‑identifying as Google‑trained LLMs; Oceanreef responses cite Sept 2025 cutoff (e.g., acknowledging Trump as president) Oceanstone sighting, Oceanreef sighting
This follows the earlier Oceanstone appearance initial sighting, now expanding to another variant; a separate AI Studio UI leak hints at coming model selection beyond Gemini 2.5 Pro AI Studio model selector

Hunyuan3D Studio publishes end‑to‑end pipeline for game‑ready 3D assets (Unity/Unreal)

Tencent’s Hunyuan3D Studio released a tech report on a modular, production‑oriented pipeline that spans part‑level generation, topology/UV (SeamGPT), PBR textures, and auto‑rigging—optimized for real‑time engines and faster content creation.

Describes a full stack from data to optimized meshes and textures, targeting rapid, consistent game‑asset output pipeline summary
ArXiv paper available with system design and component details arXiv paper

Seedream 4 High‑Res surges on LMArena, tying Nano Banana for #1 text‑to‑image and ranking #2 for edits

After being added to LMArena by request, ByteDance’s Seedream 4 High‑Res rapidly accrued votes and now ties Gemini’s Nano Banana at the top of the T2I leaderboard, while placing #2 for image editing.

Early totals show ~3.7k votes already influencing rankings; edit variants also leapfrog internal baselines leaderboard update, edit rank, arena link

Ring‑mini‑2.0 (16B total, 1.4B active) targets strong logical reasoning at sub‑10B dense quality

Ring‑mini‑2.0 is a lightweight sparse‑activation model claiming dense‑model‑class reasoning under 10B parameters. Authors report competitive scores on LiveCodeBench, AIME 2025, GPQA, and ARC‑AGI‑v1 while keeping output lengths comparable to larger MoEs.

Configuration: ~16B total parameters with ~1.4B active per token; demoed via quick chat app builds model summary
Public space available for hands‑on trials HF demo

OpenAI GPT‑5‑Codex – 2× slowdown fixed; limits reset after GPU surge

📑 Table of Contents

On this page