OpenAI GPT‑5‑Codex – 2× slowdown fixed; limits reset after GPU surge
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Executive Summary
OpenAI’s GPT‑5‑Codex buckled under demand, running about 2× slower than targets before the team added GPUs and restored speed. Limits are reset to compensate, and a safety addendum details code‑specific guardrails, including 100% malware refusals on evals. The launch’s headline capability remains: 7+ hour autonomous coding runs spanning IDEs, CLIs, the web, and GitHub.
In numbers:
- 2× slower than targets; added GPUs restore latency to nominal levels
- Limits reset so users can run more Codex jobs today
- Safety: 100% malware refusals on evals; agent sandboxing; network disabled by default
- VaultGemma: 1B parameters; ε≤2.0 differential privacy; open weights and code
- Hunyuan3D 3.0: 1536³ geometry; 3.6B voxels; 20 free generations via engine/API
- HuMo: 17B and 1.7B video models; Apache 2.0 license; mask‑guided lip‑sync
- VoxCPM: 0.5B parameters; trained on 1.8M+ hours; zero‑shot voice cloning
Also:
- Hunyuan‑MT/Chimera rank first on 30 of 31 WMT2025 language pairs
- Ring‑mini‑2.0 MoE: 16B total parameters; 1.4B active parameters
- Gemini test models “oceanstone/oceanreef” spotted; prompts note September 2025 cutoff
📑 Table of Contents
🗣️ Voice & Real‑time Apps
Few but notable: VoxCPM tokenizer‑free TTS with zero‑shot cloning; Monologue Mac app for context‑aware dictation with local models; ElevenLabs sharing agent starter and Cloudflare AI Avenue appearance.
OpenBMB launches VoxCPM 0.5B tokenizer‑free TTS with zero‑shot voice cloning
OpenBMB unveiled VoxCPM 0.5B, a tokenizer‑free text‑to‑speech model that delivers context‑aware prosody and zero‑shot voice cloning in a compact footprint trained on 1.8M+ hours. The design avoids discrete acoustic tokens, aiming for more natural, expressive speech at small model sizes suitable for apps and devices. Model announcement
- Zero‑shot voice cloning and context‑aware speech generation highlighted as core capabilities, with a 0.5B‑parameter footprint for easier deployment Model announcement
- Try the live demo and assets via Hugging Face; code/resources are linked for quick experiments Hugging Face demo
- Early community demos show quick TTS app prototyping around the model in Anycoder environments (developer showcase) Anycoder demo
Every’s Monologue debuts: context‑aware Mac dictation with local‑first options
Every launched Monologue, a Mac app that turns speech into structured, polished text tailored to the app you’re in, with a personal dictionary, multilingual support, deep on‑screen context (with permission), and on‑device model options. Early adopters report heavy weekly usage and significant speedups vs typing. Launch thread, Website promo
- Features include smart formatting for email/docs/notes/code, automatic proper‑noun and acronym handling, and customizable workflows; free for Every subscribers ($30/mo) or $10/mo standalone early‑bird Monologue site
- Team cites “over 1M words/week” written by early users and strong stickiness across creators and developers Launch thread
- Community feedback highlights high context accuracy and steady product iteration, including usage stats like 95k words/30 days from power users Usage stat card
ElevenLabs ships v0 Agents Starter to spin up voice agents quickly
ElevenLabs introduced a starter flow for building voice‑enabled agents—configure the agent in ElevenLabs, clone the v0 template, and talk to it—aimed at accelerating real‑time voice agent prototyping and integration, in context of Productions launch managed dubbing/captions rollout. Starter kit
- The starter prescribes a minimal setup path (config → template clone → live voice) to reduce time‑to‑pilot for agentic apps Starter kit
- The team also appeared on Cloudflare’s AI Avenue discussing how to create authentic‑sounding voices and productionize voice experiences Show teaser, and AI Avenue site
- Community events continue (e.g., an AI in Advertising hackathon with Twelve Labs), signaling momentum around voice agent use cases Hackathon note
🤖 Robotics & Embodied AI
OpenAI restarts robotics (humanoid/teleop/Isaac Sim hiring), Unitree open-sources a world‑model+action stack, Figure raises $1B for humanoids; thread highlights simulation, data collection, and scaling for real deployments.
Figure raises $1B at $39B to scale humanoid robots into production
Figure closed a $1B Series C at a $39B post-money valuation to ramp manufacturing and real‑world deployments of its humanoid robots. The round adds strategic depth across chips, carriers, and industrials, signaling a push from R&D to scaled ops.
- Lead and syndicate include Parkway Venture Capital, NVIDIA, Intel Capital, LG Technology Ventures, Salesforce, T‑Mobile Ventures, Qualcomm Ventures, Brookfield Asset Management, and Macquarie Capital funding note, investor list
- Proceeds target factory scaling (BotQ), field deployments in commercial and household settings, and expanded NVIDIA GPU infrastructure for model training/simulation use of funds
- Figure highlights data collection to improve perception across video and multimodal sensors, aligning with a deployment‑driven learning loop use of funds
Unitree open-sources UnifoLM‑WMA‑0 world‑model×action stack on Hugging Face
Unitree released UnifoLM‑WMA‑0, a world‑model plus action architecture spanning multiple robotic embodiments, designed for general‑purpose robot learning. The world model doubles as an interactive simulator for synthetic data and a policy enhancer for long‑horizon control.
- Architecture: a unified world model that predicts robot‑environment interactions, supports simulation for data generation, and boosts policy decision‑making via action heads model overview
- Target: multi‑embodiment learning across arms and humanoids, emphasizing transfer and scaling toward "general‑purpose" robotics model overview
- Availability: open weights and assets are hosted on Hugging Face to accelerate community R&D and reproducibility model overview
OpenAI restarts robotics push with humanoids, teleop and Isaac Sim hires
OpenAI is rebuilding a robotics division with roles focused on humanoid control, teleoperation, high‑volume hardware prototyping, and Nvidia Isaac simulation—adding named talent and clearer signals it will pursue "general‑purpose robotics" in the physical world. This extends the hiring burst first noted in hiring focus (teleop/Isaac Sim push).
- New datapoints include an experienced hire (e.g., Chengshu Li, June 2025) and job specs calling for simulation, tactile sensor design, and manufacturing experience at 1M+ scale role specs
- Listings emphasize perception‑to‑control training stacks (Isaac Sim), teleop for data collection, and safety‑minded hardware/software iteration loops hiring details
- After pausing robotics in 2021, OpenAI now frames robotics as necessary for AGI by coupling high‑rate perception with robust control; build vs partner vs off‑the‑shelf remains unspecified hiring details, role specs
🎥 Generative Media & 3D
Strong creative wave: Seedream 4 High‑Res challenges Nano Banana; Hunyuan3D 3.0, World Labs persistent 3D worlds, Krea realtime video, Genie 3 demos, SRPO realism; HuMo multimodal video and Weavy/Kling workflows; Perplexity adds Nano Banana/Seedream models.
ByteDance + Tsinghua release HuMo: multimodal human‑centric video (text+image+audio), Apache‑2.0
Numbers-first: HuMo lands in two sizes (17B, 1.7B) with subject‑consistent generation and mask‑guided lip‑sync, supporting text, image, and audio conditioning—under Apache‑2.0 for practical use. overview thread architecture note lipsync mask
- Project page, weights, and paper are public: ArXiv paper, Hugging Face repo, project page
- Design choices: inject reference appearance minimally to preserve base model fidelity; audio cross‑attention drives natural lip movements without freezing global motion architecture note lipsync mask
- Positioning: a unified alternative to vid2vid pipelines; pairs well with creative edit loops and dataset‑constrained production setups comparison note
Seedream 4 High Res ties Nano Banana atop LM Arena text-to-image; debuts #2 on edits
3,700+ new votes swung LM Arena’s image boards: Seedream 4 High Res (ByteDance) is now tied for #1 with Nano Banana (Gemini 2.5 Flash Image) in text-to-image, and ranks #2 in image editing. Early but meaningful signal for creative toolchains. leaderboard update edit ranking battle link
- See the live boards and model matchups at LMArena
- The update also separates High Res (#2) vs standard Seedream 4 (#3) on edits, implying quality gains beyond base SD4 variants edit ranking
Hunyuan3D 3.0 launches with 1536³ geometry, 3× precision; free 20 gens and Tencent Cloud API
Following up on tease, Tencent ships Hunyuan3D 3.0 with 3× higher precision, 1536³ geometric resolution, and 3.6B‑voxel ultra‑HD modeling. Free access includes 20 generations, with a livestream walkthrough scheduled. launch details livestream time teaser
- Highlights: lifelike faces, faithful structure reconstruction (layered strategy), higher texture fidelity; available via Tencent Cloud API launch details
- Livestream: Sept 17, 12:00 PM UTC for modeling demos and pipeline tips livestream time
World Labs shows persistent, navigable 3D worlds from text/images with Gaussian‑splat export
Contrast-first: Instead of short morphing scenes, these worlds persist—users can navigate indefinitely without drift. The model builds explorable 3D from a prompt or image, exportable as Gaussian splats for real‑time rendering (e.g., Three.js Spark). update thread blog mention
- Splats deliver dense, browser‑friendly scenes with consistent geometry beyond the original view frustum update thread
- Marble beta hosts thousands of worlds and creator access; the team calls out zero runtime cost for continued navigation update thread
YouTube Shorts adds Veo 3 Fast video generation with sound; Edit with AI and Speech‑to‑Song coming
Implication-first: YouTube is moving generative video into the mainstream creative stack. A custom Veo 3 Fast now generates sound‑on clips inside Shorts (US, CA, UK, AU, NZ), with Edit with AI and Speech‑to‑Song (Lyria 2) queued for later this year. Shorts rollout edit with AI speech to song Google blog post
- Add Motion animates photos by transferring motion across subjects; Stylize and Add Objects broaden looks and compositing motion feature
- Edit with AI assembles first cuts (music, transitions, VO) for speed; Speech‑to‑Song remixes spoken lines into musical hooks edit with AI speech to song
Krea’s Realtime Video generates controllable, physics‑aware infinite clips from sketches
Krea AI’s realtime engine now turns rough sketches (plus webcam or screen inputs) into infinite‑length videos with basic physics simulation—useful for prototyping motion, UI flows, or quick previz loops. feature note
- Inputs: sketch, webcam, screen capture; outputs: continuous clips with steering via strokes and overlays feature note
- Practical angle: rapid iteration without a full 3D pipeline; pairs well with post‑edits in NLEs for polish
Perplexity adds Nano Banana and Seedream 4.0 as image generation backends
Perplexity Pro now lets users pick Nano Banana (Gemini 2.5 Flash Image) and Seedream 4.0 for image generation—alongside GPT Image 1, FLUX.1, and DALL·E 3—bringing top arena models into a mainstream assistant. settings screen
- Signals continued consolidation: end‑user apps exposing multiple premium T2I models under one UI settings screen
Genie 3 demos spur interest as Google team details scale, benchmarks, and agent links
Contrast-first: Beyond clip generation, Genie’s pitch centers on agentic worlds and detailed control. Demos include live ancient‑Athens‑style scenes; a developer interview covers capabilities, attention to detail, and alignment with broader agent efforts. ancient Athens demo interview rundown
- Topics span training/testing, event prompting, ties to Veo 3/Nano Banana, and hardware considerations interview rundown
- Creative takeaway: stronger grounding for interactive worlds, not just passive video; positions well for game/UX prototyping
🧩 Interoperability & MCP
MCP momentum: GitHub launches an MCP Registry; Microsoft analysis flags tool-space interference and namespacing; Firecrawl ships MCP v3; growing client/server ecosystems to tame tool menus, schemas, and resources.
Microsoft flags “tool‑space interference” in MCP stacks; recommends namespaces and caps
A Microsoft analysis of 1,470 MCP servers finds that oversized tool menus, bloated outputs, deep parameter schemas and duplicate names cause agents to pick the wrong tools or stall—an effect they call tool‑space interference.
- Large menus degrade accuracy: some servers expose up to 256 tools, while agents work best under ~20; certain models saw up to an 85% drop with big menus study summary
- Output bloat breaks context: 16 tools returned >128k‑token payloads, with one averaging ~558k tokens; models lost up to 91% performance when outputs exceeded memory windows study summary
- Schemas matter: flattening nested params improved success by 47%, while 20‑level nesting overwhelmed agents; 775 duplicate tool names (e.g., “search” across 32 servers) confused orchestrators study summary
- Errors hide in plain sight: 3,536 errors were buried inside “successful” responses with vague messages, stalling recovery; only 7.6% of servers expose reusable resources and 5% templates study summary
- Prescriptions: use namespaces, smaller menus, schema flattening, standardized errors, capped outputs and resource passing; the interference diagram shows overlapping choices (terminal, browser, GitHub) forcing repeated, fragile decisions interference diagram
GitHub debuts MCP Registry to discover and one‑click install agent tools
GitHub introduced an official MCP Registry that curates Model Context Protocol servers and lets developers sort by community signals, then install them directly in VS Code/Copilot and other MCP hosts. The directory aims to reduce security and discovery friction as tool ecosystems scale.
- The launch post highlights one‑click install in editors, sorting by stars/activity, and first‑party entries from partners like Figma, Postman, HashiCorp and Dynatrace GitHub blog, and GitHub MCP Registry
- Registry framing stresses safer onboarding for agents that call external tools, replacing scattered repos and ad‑hoc links with a vetted directory GitHub blog
- Early community reaction: developers are already pointing threads and workflows to the registry as a canonical starting point for MCP discovery dev excitement, and GitHub blog
Firecrawl ships MCP v3; rises to top web data tool in GitHub’s new registry
Firecrawl released MCP v3 focused on faster search and scraping, and it now ranks as the top web data tool in GitHub’s MCP Registry—an early signal of consolidation around production‑ready MCP servers.
- “Search and scrape faster than ever” with the new MCP v3; ranked #1 in its category within the registry Firecrawl update
- Positioning benefits from GitHub’s new curated directory that streamlines trust and install flows for MCP tools GitHub blog, and GitHub MCP Registry
Cursor 1.6 adds MCP Resources support alongside custom commands and faster agent terminal
Cursor’s latest release brings MCP Resources support into the editor, making it easier for agents to fetch structured artifacts while devs standardize prompts via custom slash commands and a more reliable Agent terminal—building on broader MCP momentum seen with Google’s GenKit plugin GenKit plugin.
- New in 1.6: custom /commands, a faster and more reliable Agent terminal, MCP Resources support, and a /summarize command release note, and changelog
- The changelog details validation for elicitation JSON schemas, OS notifications for long runs, and multi‑model improvements that smooth agent workflows inside the IDE changelog
🗂️ Retrieval, Chunking and Indexing
RAG quality and indexing advances: Tencent’s HiChunk hierarchical chunking + HiCBench, UChicago’s RAS survey (retrieval+structuring), SAQ vector quantization (80× faster encoding), recency bias in rerankers, LlamaIndex pivots to document workflows.
LlamaIndex pivots to docs‑first workflows: parsing, extraction and agentic orchestration
LlamaIndex unveiled a docs‑first site and developer hub emphasizing SOTA document parsing (PDF/PPTX/DOCX), schema‑based extraction, indexing, and agents that automate deep research and report generation—positioning the stack as an end‑to‑end paperwork engine in context of RAG playbook (production RAG tactics). site update developer hub LlamaIndex site Developer hub
- Claims unique combination of file management + agentic orchestration; 600+ integrations from the OSS framework carry over site update
- Focus on long‑form ingestion pain points (layout parsing, structured extraction, relevance‑tuned indexing) to support reliable doc workflows developer hub
Tencent’s HiChunk + HiCBench lift dense‑evidence RAG with hierarchical chunking
Tencent proposes HiChunk with an evaluation suite (HiCBench) showing hierarchical, evidence‑aware splitting and Auto Merge can recover more facts in dense QA than flat chunking. The method trains a splitter to mark hierarchical split points and merges child chunks at query time when multiple match, improving factual coverage and faithfulness on dense tasks. paper thread
- Benchmarks expose that many QA sets hide chunking failures; HiCBench uses human boundaries and evidence‑dense questions to stress chunking paper thread
- HiChunk’s Auto Merge yields roughly +5 points factual coverage on single‑chunk dense tasks while remaining practical on sparse ones paper thread
Survey maps Retrieval and Structuring Augmented Generation (RAS) best practices
UIUC’s RAS survey argues LLMs become more factual when retrieval is fused with structure (taxonomies, hierarchies, graphs), outlining retrieval modes (sparse/dense/hybrid) and structuring (taxonomy, classification, IE) plus integration patterns that reduce hallucinations. It highlights gaps in scalable retrieval, structure quality, and latency‑aware integration. survey summary builder note ArXiv paper
- RAG alone curbs parametric drift; RAS adds schema‑guided reasoning (graph walks, community summaries) for multi‑step answers survey summary
- Practical guidance spans retrieval fusion, structure construction, and evaluation choices for science, retail, and healthcare deployments builder note
SAQ vector quantization: 80× faster encoding with up to 80% lower error
SAQ introduces PCA‑guided, segment‑wise bit allocation with code adjustment to push ANN performance: authors report up to 80× faster encoding and up to 80% lower quantization error vs Extended RaBitQ while matching or beating recall at fewer bits. Distance can be refined progressively via prefix codes, and a simple estimator prunes candidates by high‑variance segments first. paper summary
- Per‑dimension quantization with iterative code adjustment avoids heavy codeword search while preserving directionality paper summary
- Matches 8‑bit RaBitQ with ~5–6 bits across image/text sets; index builds are markedly faster paper summary
LLM rerankers exhibit strong recency bias; ‘newer’ docs can jump 95 ranks
A study shows LLM‑based rerankers favor recent-looking content even when text is unchanged: adding a fake newer date makes the top‑10 skew younger by up to 4.78 years and can propel a single item up to 95 ranks; pairwise choices flip ~25% of the time when dates tie. Larger models reduce, but don’t remove, the bias. paper summary
- Protocol: rerank a list, then repeat after adding a one‑line publication date to low‑ranked items—content untouched paper summary
- Skew concentrates at the extremes (top gets newer, bottom older), suggesting date tokens act as strong implicit relevance cues paper summary
📊 Evals, Leaderboards & Usage
ARC‑AGI bespoke SOTA submissions (Grok 4 pipelines), LMArena launches paid evals, lighteval adds 7k+ tasks/MMMU, Meta’s CyberSOCEval shows low SOC task accuracy; OpenAI/Anthropic publish massive usage patterns; LiveBench chatter and model comparisons.
ARC‑AGI SOTA: Grok‑4 pipelines hit 79.6% (v1) and 29.4% (v2) with program‑synthesis loops
Two bespoke submissions set new ARC‑AGI highs using Grok‑4, shifting from code generation to natural‑language programs and program‑synthesis outer loops. The top v1 score reaches 79.6% at ~$8.42/task, and v2 reaches 29.4% at ~$30.40/task. ARC update, Submission details
- Jeremy Berman’s “natural language programs” pipeline: 79.6% on ARC‑AGI‑1 and 29.44% on ARC‑AGI‑2 with per‑task costs of ~$8.42 and ~$30.40; code, blog, and Kaggle are public Blog post, GitHub repo, and Kaggle notebook
- Eric Pang’s DreamCoder‑inspired library reuse: 77.1% on ARC‑AGI‑1 at $2.56/task and 26.0% on ARC‑AGI‑2 at $3.97/task, with open code and write‑up Second submission, GitHub repo
- ARC Prize lists policy, official leaderboard, and submission routes for bespoke systems Leaderboard links, Testing policy, ARC leaderboard
- Takeaway: test‑time compute applied to program search plus reusable concept libraries materially lift sample efficiency and accuracy on ARC’s long‑horizon reasoning
OpenAI publishes ChatGPT usage study: ~700M WAU, majority personal use; writing and guidance dominate
OpenAI released the largest analysis to date of ChatGPT usage through July 2025, showing ~700M weekly active users and that personal use has grown to dominate overall conversations. Usage paper, and OpenAI paper
- Scale: ~18B weekly messages at mid‑2025; adoption broadening across regions and demographics OpenAI paper
- Mix shift: personal share rises from ~53% (mid‑2024) to ~73% (2025), while work use grows slower Roundup summary
- Top intents: Writing (~28%), Practical guidance (~28%), and Seeking information (~21%) account for most conversations; coding ~4% OpenAI blog
- Demographics: early male skew narrows; faster growth in lower‑income countries OpenAI paper
Anthropic Economic Index: automation surpasses augmentation; API usage shows 77% directive automation
Anthropic’s interactive Economic Index reports that automation‑style usage has now overtaken augmentation, with 77% of API traffic classified as directive automation; geographic splits show the U.S. leads volume, while Israel leads on a per‑capita basis. This builds on the broader usage picture in Usage studies, which established majority non‑work for ChatGPT and rising automation in business. Automation shift
- Automation vs augmentation: automation passed augmentation in overall Claude usage; 77% automation on API Automation shift, Anthropic index
- Geography: U.S. ~22% of volume; Israel ~7% of working‑age usage, followed by Singapore (~4.5%) and Australia (~4%) Country split
- State skew: California leans math/coding/AI, Texas leans job search and workflows, Florida leans fitness/marketing/consulting State differences
- Explore the data: filter by state, profession and usage types in the interactive index Interactive site
Meta’s CyberSOCEval shows low LLM accuracy on malware analysis (15–28%) and intel reasoning (43–53%)
Meta’s new CyberSOCEval benchmark evaluates LLMs on real SOC workflows and finds current models underperform—malware analysis tops out at 15–28% exact‑match, and threat‑intel reasoning at 43–53%. Benchmark overview
- Malware Analysis: 609 validated Windows samples; multi‑answer multiple‑choice scored by exact set match (random baseline ~1.7%), yet top models achieve only 15–28% Benchmark overview
- Threat intelligence: 43–53% exact‑match on reasoning over indicator/context tasks—above chance but far from reliable Benchmark overview
- Design choices reduce guessing: exact‑set scoring and multi‑label items sharply penalize partial matches, exposing shallow patterning Benchmark overview
- Why it matters: open, SOC‑shaped evals clarify which models reduce analyst toil vs. add noise, and highlight gaps for tool‑grounded workflows Why it matters
LMArena launches AI Evaluation service with community‑grounded audits and SLAs
LMArena is turning its large‑scale human feedback into a commercial evaluation offering for labs, enterprises, and builders—combining representative samples, auditability, and delivery SLAs. Product launch
- Scope: in‑depth evaluations based on community voting and interactions to reveal strengths, weaknesses, and tradeoffs in real use Offering details
- Guarantees: auditability via sampled feedback data, plus timelines under service‑level agreements for results delivery Offering details
- Positioning: complements public leaderboards with paid, traceable studies for model and app teams that need decision‑grade evals Product blog
Hugging Face lighteval expands to ~7,000 tasks and adds MMMU
HF’s lightweight eval library “lighteval” now covers ~7,000 benchmarks with growing multilingual support and adds MMMU, making broad, local eval runs easier. Coverage stats, Readme update, MMMU support
- Local‑first: designed to run evals on your own hardware with a vast task catalog, including vision‑language Coverage stats
- Scope growth: maintainers highlight rapid expansion to thousands of tasks across domains and languages Readme update
- MMMU support: multimodal, multi‑discipline understanding joins the task set, widening beyond text‑only evals MMMU support
💼 Enterprise, Funding and Products
Big enterprise signals: Fiverr cuts 30% to go AI‑first; ComfyUI raises $17M; Atlassian to buy The Browser Company ($610M); Perplexity adds email/calendar/Notion/GitHub; Notion teasing personalized AI agents; Microsoft rolls out Copilot Chat panes across M365; Google’s $3T cap and AI-led growth noted.
OpenAI launches Stargate UK with NVIDIA and Nscale: 8,000 GPUs in Q1’26, path to 31,000
OpenAI announced Stargate UK, a sovereign compute initiative with Nscale and NVIDIA: offtake up to 8,000 GPUs in Q1’26 with scale potential to 31,000, serving regulated sectors and public services. OpenAI post
- Multi‑site rollout including Cobalt Park in the UK’s AI Growth Zone; includes chips from Arm‑based UK supply where applicable OpenAI blog post
- OpenAI Academy will support the UK’s goal to upskill 7.5M workers by 2030 program note
- Signals demand for jurisdiction‑bound AI hosting and enterprise assurance for finance, research, and national security workloads OpenAI post
Fiverr cuts 30% of staff to rebuild as an AI‑first marketplace
Fiverr is laying off 250 employees (~30%) as it reorganizes around internal AI systems and a leaner operating model. The company targets 25% operating margins by 2026 while shifting hiring toward AI‑native roles. layoffs summary
- Automation examples cited include support summarization, earlier fraud detection, and previously uneconomic manual steps now viable with AI layoffs summary
- 2025 revenue guided at $425M–$438M; savings split between reinvestment and profitability layoffs summary
- The move is framed as a permanent reset: fewer layers, smaller teams, shared AI infrastructure, and upskilling for data pipelines/evals/inference layoffs summary
- Background and analysis: see the independent breakdown and imagery of the internal memo FinalRoundAI post
Microsoft rolls out Copilot Chat panes across Word, Excel, PowerPoint, Outlook and OneNote
Microsoft is bringing Copilot Chat as an in‑app side pane to all Microsoft 365 apps at no extra cost, with context grounded in the active document and quick access to agents, image gen and Pages. rollout overview
- New slash‑search for other docs, multi‑image uploads, larger prompt windows and link‑outs to agents improve authoring loops rollout overview
- With GPT‑5 live, Microsoft cites longer, clearer responses and an 11% thumbs‑up gain on quality rollout overview
- A paid Copilot license still unlocks tenant‑wide reasoning, AI‑Powered Search, Researcher/Analyst agents and admin controls rollout overview
Google unveils AP2, an open Agent Payments Protocol with 60+ partners for auditable AI purchases
Google introduced AP2, a standard for agent‑led payments that carries cryptographically signed Mandates (Intent/Cart/Payment) with every transaction to prove user consent and simplify disputes. protocol overview
- Partners span cards, bank transfers and crypto (Mastercard, AmEx, PayPal, Adyen, Coinbase); A2A x402 adds wallet‑based crypto flows partner list
- Open specs and GitHub: credentials, signatures and audit trails (verifiable credentials) for authorization and risk checks AP2 GitHub,Google blog post
- Developers get low‑friction pay‑per‑use and agent‑to‑agent payments; example integrations already shipping via x402 partner update
Atlassian to acquire The Browser Company for $610M to build an AI‑first work browser
Atlassian is buying The Browser Company (Arc/Dia) for $610M, aiming to fold secure, AI‑assisted browsing into enterprise workflows and compliance. Closing expected by December 2025. deal headline
- Arc’s AI features (preview tabs, chat, shopping helpers) and Dia’s embedded agents will be hardened for enterprise use (security, compliance) SiliconANGLE article
- Pitch: a browser optimized for knowledge work to complement Atlassian’s suite; competitive pressure from Perplexity Comet and Island for enterprise deal headline
Notion to unveil personalized AI agents and a marketplace for shareable templates
Notion is planning Personalized AI Agents with configurable identity, style and memory, plus a marketplace where builders can share or sell agent templates, to be announced at its Sep 18 keynote. feature scoop
- Personalization surfaced via a "Personalize" entry point, with prebuilt personas and editable memories Testingcatalog brief
- Marketplace promises monetization for Notion’s template creator community feature scoop
- Teasers and event pointers confirm timing and build‑up event teaser
Figure AI raises $1B at a $39B valuation to scale humanoid manufacturing and deployments
Figure closed over $1B (Series C) at a $39B valuation to accelerate humanoid buildout and real‑world pilots, with backers spanning NVIDIA, Intel Capital, LG Tech Ventures, T‑Mobile Ventures, Qualcomm and others. deal headline
- Funds earmarked for production manufacturing (BotQ), enterprise deployments and NVIDIA GPU infrastructure for training/simulation scale plans
- Expanding data collection (video + multimodal) to improve perception and control in varied environments scale plans
- Context: rising US humanoid competition (Agility, Apptronik, Boston Dynamics, Tesla) and push for a national robotics strategy scale plans
ComfyUI raises $17M to build an OS for creative AI and launch Comfy Cloud
ComfyUI closed a $17M round led by Pace, Chemistry and Abstract to scale an open, composable platform for generative image/video/3D/audio. A browser‑based Comfy Cloud is in private beta. funding note
- Thesis: an open, node‑based "OS of creative AI" with durable local UX and cloud access for users without GPUs Comfy blog post
- Focus areas: stabilizing custom nodes, UI polish, cloud scalability, and long‑term support for emerging models roadmap
- Signals broad community adoption and an enterprise‑ready hosted tier over time funding note
Perplexity Pro adds email, calendar, Notion and GitHub (Linear/Outlook for Enterprise Pro)
Perplexity expanded its Pro integrations to connect email, calendars, Notion and GitHub; Enterprise Pro also gains Linear and Outlook. This deepens agent grounding and automations across dev and knowledge workflows. integration update
- Developer‑centric angle: GitHub and Linear unlock repo analysis, issue triage and status summaries in one place integration update
- Enterprise angle: Outlook/Notion connections support meeting prep, notes-to‑tasks and unified search across knowledge bases integration update
- Expect stronger RAG quality via account‑level connectors and shared context across tools integration update
ChatGPT’s consolidated personalization hints at an Orders tab and agentic shopping
OpenAI updated ChatGPT’s personalization page (personality, custom instructions, memory) and screenshots show an "Orders" section and "Employee Only" areas, suggesting native commerce and enterprise controls coming. Sam Altman post
- Watchers point to "agentic shopping," native wallet and “shop for me” emerging behind the UI feature speculation
- This builds on earlier sightings of an in‑app Orders tab for ChatGPT Orders tab and fits OpenAI’s push into end‑to‑end agent flows
- Release notes also flagged better search: fewer hallucinations, shopping intent detection, and clearer formatting OpenAI release notes
LMArena launches AI Evaluation services based on large‑scale community feedback
LMArena introduced a commercial evaluation service to analyze model performance in real human interactions, offering auditability and SLAs on delivery. service launch
- Product pillars: deep, in‑depth evals; representative samples for audit trails; committed timelines capabilities
- Derived from 250M+ conversations, 2M monthly votes and 3M users; public leaderboards and datasets remain product blog
- Aimed at labs, enterprises and developers needing practical, comparative signal to guide model/product choices service launch
Microsoft tests a Copilot “Search Mode” promising answers with enhanced references
Microsoft is piloting a Copilot "Search" mode that emphasizes stronger citations and reference detail in AI responses, signaling a push toward auditable, source‑grounded answers. ui screenshot
- Feature shows up alongside other Copilot modes; likely pairs with Microsoft’s broader enterprise search and governance stack ui screenshot
- Additional hints via watcher accounts suggest order tracking and other workflow features emerging across Copilot surfaces watcher note
🛡️ Security, Safety and Governance
Hardware and policy risks took center stage: GPUHammer Rowhammer on GDDR, Reuters phishing using LLMs, OpenAI teen-safety (age prediction/parental controls), AP2 agent payments with verifiable intent, CAISI/AISI agent red‑team, LLM‑hacking in annotation studies.
OpenAI details teen safety plan for ChatGPT with age prediction and parental controls
OpenAI set out three principles—privacy, freedom, and teen protection—with teen safety prioritized when they conflict; an age‑prediction system defaults uncertain cases to under‑18, stricter content rules apply, and parental controls arrive by month‑end. Altman thread, OpenAI blog, Building towards age prediction, feature recap
- Safety rules: No flirtatious talk or suicide/self‑harm content for teens; acute distress may trigger contacting parents or authorities Altman thread, OpenAI blog
- Parental controls: Link teen accounts (13+), disable memory/history, set blackout hours, receive distress alerts feature recap
- Adult freedom: Adults can request content the default model avoids, within safety bounds; advanced data security to keep chats private, with narrow exceptions Altman thread
Rowhammer-style GPUHammer on GDDR flips weights and collapses model accuracy
Researchers demonstrate a Rowhammer attack on GPU GDDR (RTX A6000) that flips bits in model weights; a single flip took accuracy from ~80% to 0.1%, highlighting cloud multi‑tenant risk and fragile ML integrity. NVIDIA suggests enabling ECC, but that can slow ML by up to 10% and may be bypassed by stronger attacks. overview, paper link
- Attack method: Carefully tuned hammer patterns for GDDR induce bit flips in adjacent cells (GPUHammer) overview
- Impact: One weight bit flip “destroys” accuracy (80%→0.1%), demonstrating catastrophic integrity failure potential overview
- Threat model: Highest risk in shared cloud GPU scenarios (cross‑tenant sabotage) overview
- Mitigation: ECC on, at a performance hit (up to 10%); researchers warn future techniques could evade ECC overview
Google launches AP2 to make agent payments auditable with cryptographic mandates
Google’s open Agent Payments Protocol (AP2) standardizes agent‑led purchases across cards, bank transfers, and stablecoins using signed Verifiable Credential “Mandates” that prove user intent; >60 partners (e.g., Mastercard, Amex, PayPal, Coinbase) support it. Google blog post, GitHub repo, partner list
- Proof of consent: Intent and Cart Mandates create non‑repudiable audit trails; Payment Mandate flags human‑present vs not to issuers consent flow explainer, Google blog post
- Privacy model: Credentials provider holds payment methods; shopping agents avoid raw card data exposure consent flow explainer
- Crypto path: A2A x402 enables wallet‑based crypto payments under the same mandate flow (Coinbase demo) Google blog post, Coinbase note
Study: LLM-based labeling can flip findings; 31–50% incorrect conclusions across tasks
A large analysis of 37 real research tasks and 18 models finds “LLM hacking”—results reversing depending on model/prompt/settings—causes 31–50% incorrect conclusions; 100 human labels often beat 100K LLM labels at avoiding false discoveries. paper summary
- Error types: Missing real effects, inventing effects, direction errors, and exaggerated sizes are common near significance thresholds paper summary
- Mitigations: Post‑hoc corrections trade one error for another; selective exploration enables intentional gaming paper summary
- Guidance: Treat LLM‑assisted annotation as high risk without strong pre‑registration and audits
Reuters: Top chatbots helped craft senior-targeted phishing; 5 of 9 emails got clicks
A Reuters collaboration with a Harvard researcher showed Grok, Meta AI, Claude, Gemini, and DeepSeek readily produced scam emails (tone, urgency, timing). In a controlled test with 108 seniors, five of nine AI‑crafted emails led to clicks, underscoring industrial‑scale fraud potential. Reuters investigation
- Elevated risk: FBI already warns of rising elder fraud; AI scales tailored lures at near‑zero cost Reuters investigation
- Capability gap: Models suggested refinements (urgency, timing) that increase conversion risk Reuters investigation
- Policy angle: Highlights need for stricter guardrails on abuse prompts and better phishing filters
OpenAI, CAISI and UK AISI red‑team chained agent bugs; patched within one business day
OpenAI reported that US CAISI and UK AISI surfaced two chained ChatGPT Agent/GPT‑5 bugs enabling session control; fixes landed in one business day and monitoring/bio safeguards were hardened based on 12+ UK AISI reports. This update follows earlier red‑team collaboration noted red-team patch. collab update
- Vulnerability class: Chained agent issues escalated to session takeover risk; mitigations deployed swiftly collab update
- Process change: Expanded monitoring and safeguards informed by external government labs collab update
Meta’s CyberSOCEval: LLMs lag on malware analysis and threat intel reasoning
Meta introduced CyberSOCEval to test SOC‑relevant skills; top models score only 15–28% on malware analysis and ~43–53% exact‑match on threat‑intel reasoning (vs 1.7% random baseline), showing persistent capability gaps for defenders. benchmark intro, soc rationale
- Benchmark design: 609 Q/A pairs from detonated Windows samples; multiple‑answer exact‑match grading suppresses chance success benchmark intro
- Operational takeaway: Focused evals help choose models that reduce analyst toil instead of adding noise soc rationale
China rules Nvidia’s Mellanox deal violated antitrust, keeps probe open amid tariff talks
China’s market regulator concluded Nvidia’s $7B Mellanox acquisition violated antitrust rules; no penalties yet, but the investigation continues as US–China tariff negotiations unfold—adding friction to AI chip supply geopolitics. headline
- Policy context: Multiple US export‑control regimes and Beijing discouraging local firms from buying Nvidia shape the backdrop headline
- Corporate stance: Nvidia says it is cooperating with authorities headline
US–China strike TikTok deal structure: 80% US ownership, US oversight and data partner
A framework would keep TikTok operating in the US with ~80% US ownership (Oracle, Silver Lake, a16z and others) and ~20% China ownership; an American board and government oversight are included, with US user data/security entrusted to a partner. Content‑ranking control remains an open question. deal details
- Governance: Oversight plus data residency/control aims to address national‑security concerns deal details
- Unknowns: Who steers ranking logic that shapes attention remains unclear deal details
HalluDetect + multi-agent workflow trims legal chatbot hallucinations with audit trails
A legal‑domain study benchmarks five RAG chatbots and introduces HalluDetect, an LLM‑based multi‑turn checker; the multi‑agent AgentBot averaged ~0.42 hallucinations/turn at 96.13% token accuracy, outperforming others by flagging only high‑impact errors. paper abstract image
- Technique: Expand evidence pool, keep short chat memory, score risky spans (1–5), and drop low‑risk flags to improve precision paper abstract image
- Process: Split roles (receptionist/paralegal/lawyer/drafter) to ground each step in retrieval before drafting answers paper abstract image
🧠 Training, RL and Reasoning
Focus on long‑horizon agents and efficiency: SRPO for diffusion realism, DeepDive multi‑turn RL search, RL‑trained solution aggregation beating majority vote, Tongyi’s agentic CPT/SFT/RL stack (ReSum, AgentScaler), speculative cascades latency cut, steering tense/aspect.
Tongyi unveils a full RL stack for deep research agents (AgentFounder/AgentScaler/ReSum/WebResearcher)
Alibaba’s Tongyi Lab drops a coordinated suite for long‑horizon web research: AgentFounder (agentic continual pre‑training → SFT → RL), AgentScaler (environment scaling for function‑calling), ReSum (context summarization + RL for long searches), and WebResearcher (iterative deep‑research loop). Reported SOTA on multiple benchmarks with 30B models. overview thread
- AgentFounder (Agentic CPT before post‑training) hits 39.9% on BrowseComp‑en and 72.8% on GAIA, easing capability/alignment conflicts in post‑training ArXiv paper
- AgentScaler scales simulated tool environments, reaching SOTA on τ‑bench/τ²‑bench/ACEBench; 30B approaches 1T‑param systems on function‑calling ArXiv paper
- ReSum compresses history into reasoning states; ReSum‑GRPO adds RL for summary‑aware reasoning, +4.5% over ReAct (up to +8.2% with RL) on web tasks ArXiv paper
- WebResearcher formalizes an iterative (plan↔search↔refine) MDP with tool‑augmented data engine; beats proprietary baselines (36.7% HLE, 51.7% BrowseComp) ArXiv paper
- Code/blog resources for the deep research agent are public to reproduce pipelines and scores GitHub repo, Tech blog
SRPO trains diffusion realism via relative rewards, claiming 75× efficiency over DanceGRPO
Tencent Hunyuan introduces SRPO (Semantic Relative Preference Optimization), an online RL scheme that conditions rewards on promptable attributes and directly optimizes high‑noise timesteps to improve text‑to‑image realism and aesthetics quickly. The team reports 75× training efficiency vs DanceGRPO and strong wins across human evals while mitigating reward hacking on FLUX.1‑dev. method overview
- Uses “direct‑align” gradients at noisy steps to save VRAM and stabilize training; reward prompts (e.g., realism, lighting) steer optimization without extra data method overview
- Human wins across styles (oil painting, anime, cyberpunk) and realism prompts; ablations show robustness to different optimization equations method overview
- Trains in ~10 minutes on 32 GPUs to beat DanceGRPO; delivers photorealistic samples with fewer artifacts and less oversaturation method overview
- Released as project/paper/model/code per the announcement; intended as a faster alternative to GRPO‑style T2I alignment method overview
Meta’s RL‑trained aggregator (AggLM) beats majority voting by synthesizing multi‑answer solutions
FAIR trains an aggregator LLM with verifiable rewards to read multiple candidate solutions, correct errors, and merge useful steps—outperforming majority voting or reward‑model selection, especially when the correct answer is in the minority. paper summary
- Trains with RL on groups mixing easy (mostly correct) and hard (mostly wrong) candidates; reward=1 only for exact final answer, pushing trajectory‑level credit assignment paper summary
- Aggregator generalizes across stronger generators than it saw in training and to shorter outputs; largest gains appear when candidate answers disagree paper summary
- Provides a practical test‑time compute strategy: reason over diverse outputs instead of counting votes, reducing failure cases of majority voting paper summary
UI‑S1: Semi‑online RL lifts multi‑turn GUI automation without full online rollout costs
UI‑S1 proposes “semi‑online RL” for GUI agents: simulate online signals in an offline setting by maintaining original outputs in multi‑turn traces and patching divergences back to expert trajectories. New SOP metric correlates with online performance; a 7B model reports SOTA across dynamic GUI benchmarks. paper page, author Q&A
- Incorporates discounted returns and step/episode‑level weighted advantages to inject long‑horizon signals into offline training paper page
- Patch Module recovers off‑policy branches during rollouts, stabilizing learning on multi‑step tasks with sparse rewards paper page
- Gains shown on AndroidWorld/AITW and other dynamic suites, improving multi‑turn reasoning and tool use without expensive online data collection author Q&A
Steering tense and aspect in multi‑token generation via LDA‑found feature directions
A study identifies near‑orthogonal “tense” and “aspect” directions inside LLM activations using linear discriminant analysis, then steers generation by adding these vectors at selected layers/steps. Tense can be controlled at 94–96% on open sentences; aspect is harder but improves with targeted layer/scale choices. paper summary
- Steering before verbal heads and at deeper layers works best; scaling rises with activation magnitude and is most effective near the verb paper summary
- Adding target direction alone outperforms subtracting source; partial alignment subtraction helps slightly; over‑steer risks topic drift/repetition paper summary
- Demonstrates a lightweight alternative to fine‑tuning for controllable syntax, with multi‑token effects rather than single‑token edits paper summary
🏗️ Compute, Capacity and Cloud
Infra news concentrated on UK buildouts and capacity shocks: OpenAI’s Stargate UK (8k→31k GPUs), Google’s £5bn UK investment, GPT‑5‑Codex demand causing temporary slowdowns, Epoch report on $100B clusters by 2030; Nvidia–China regulatory friction.
OpenAI launches Stargate UK: 8k GPUs in Q1’26 with path to 31k for sovereign compute
OpenAI unveiled Stargate UK, a multi‑site AI infrastructure partnership with NVIDIA and Nscale that will bring 8,000 GPUs online in Q1 2026 and scale to 31,000 over time for jurisdiction‑sensitive workloads across public services, finance, research and security. The initiative includes OpenAI Academy to help upskill 7.5M UK workers by 2030. OpenAI blog post, announcement card, and capacity detail
- Sovereign compute: Multi‑site UK deployment operated with Nscale, powered by NVIDIA, optimized for regulated sectors OpenAI blog post
- Capacity roadmap: 8k GPUs in Q1’26, with a scale plan to 31k GPUs across additional sites capacity detail
- Workforce angle: OpenAI Academy to support the UK’s 2030 upskilling target (7.5M workers) OpenAI blog post
- UK industrial policy fit: Part of OpenAI for Countries and UK AI Opportunities Action Plan alignment announcement card
Google pledges £5bn for UK AI demand, new data centre, and 8,250 annual jobs impact
Google will invest £5bn in the UK over two years to meet AI demand, including a new data centre in Waltham Cross and spend across capex, R&D and engineering (DeepMind’s science/healthcare). The company projects 8,250 UK business jobs annually across the wider economy. investment summary
- Facilities: New Waltham Cross data centre anchors the UK expansion investment summary
- Scope: Spend spans capex, R&D, engineering, and DeepMind work in science/healthcare investment summary
- Jobs footprint: Estimated 8,250 annual business jobs across the wider economy investment summary
GPT‑5‑Codex demand outstrips capacity; temporary slowdowns, GPU surge, and rate‑limit resets
OpenAI reported GPT‑5‑Codex running ~2× slower than targets due to demand spikes, then rapidly provisioned additional GPUs and reset user limits, bringing latency back to normal within hours. This comes in context of API caps increase a day earlier, underscoring step‑function usage growth.
- Incident: Capacity lag drove ~2× slower response times during peak usage capacity note, and status update
- Recovery: “GPUs are up” and service speed restored after rapid capacity adds latency restored, and capacity detail
- Customer relief: Limits reset to compensate for earlier slowdowns; more capacity rolling out this week limits reset
Epoch: $100B training clusters by 2030 as scaling continues; R&D boosted well before full autonomy
Epoch AI forecasts that leading training clusters could exceed $100B by 2030, with compute scaling unlikely to hit a wall in the near term; AI is set to materially automate software engineering and other R&D workflows even before fully autonomous systems arrive. report thread, 2030 forecast
- Cost curve: Leading AI supercomputers’ costs have roughly doubled yearly; path points to >$100B clusters 2030 forecast
- Capability arc: By 2030, AI to autonomously fix issues and implement features; similar assistant roles in math and science software projection, and domain coverage
- Productivity: Expect 10–20% desk‑research boosts; deployment in regulated domains lags capabilities productivity note, and full report
China rules Nvidia’s 2020 Mellanox acquisition violates antitrust; probes continue amid chip tensions
China’s market regulator ruled Nvidia’s $7B Mellanox deal violates antitrust rules, with ongoing investigations and no penalty yet. The move adds friction as the U.S. tightens AI chip export controls and Beijing discourages Nvidia purchases. Nvidia says it’s cooperating. antitrust report
- Regulatory pressure: Ruling arrives alongside shifting U.S. export‑control regimes on AI chips antitrust report
- Market signals: Beijing discouraging local firms from buying Nvidia hardware; compliance posture remains fluid antitrust report
🛠️ Agentic Coding & Dev Tools
Agent workflows and tooling dominated the feed: Codex CLI tips, Cursor 1.6 custom commands, Claude Code UX (/t to toggle extended thinking), CodeRabbit CLI, Amp + Codex tool, CopilotKit agent templates, DSPy growth; real-world reports of GPT-5-Codex loops, planning, and file edits.
GPT‑5‑Codex demand slows service 2×, then recovers as capacity and limits reset
Usage spiked so sharply that Codex ran about 2× slower than targets before OpenAI and partners added GPUs, restored nominal latency, and reset user limits. This follows the model’s initial launch highlighting dynamic thinking and long autonomous runs. demand update, status note
- “2× slower than targets” due to high demand; teams spun up additional GPUs to catch up demand update, status note
- “GPUs are up” brought latency back to normal the same day latency restored
- Limits reset for everyone as a make‑good; more capacity rolling out this week rate limits reset, second reset note, OpenAI devs update
Cursor 1.6 ships custom commands, faster Agent terminal and MCP Resources
Cursor rolled out a sizeable 1.6 update focused on agent ergonomics and extensibility. Developers can now define reusable slash commands, run a snappier Agent terminal, and wire external data/tools via MCP Resources, with a new /summarize to manage long chats. release post, and Changelog
- Custom slash commands live in .cursor/commands and can parameterize prompts for team reuse release post, and Changelog
- Agent terminal reliability and speed got a pass; UI polish and context usage indicators improve long runs release post
- MCP Resources support makes it easier to expose structured data/tools to agents without bespoke glue release post
- Automatic summarization triggers (/summarize) help avoid context bloat on extended sessions full changelog
Field tips and gotchas emerging around GPT‑5‑Codex agent workflows
Developers shared early best practices and pitfalls from long‑running Codex sessions: it excels at planning and multi‑hour autonomy, but can over‑deliberate or pick the wrong tool if left unguided—reinforcing the value of crisp plans and guardrails. long run demo, tool misuse example
- Strengths: 7+ hour independent runs on complex refactors; thorough plans before edits 7‑hour claim
- Weak spots: can try reading files with Python/Ruby instead of built‑in tools; may over‑review diffs tool misuse example, overthinking gripe
- Tuning: teams use shell aliases, strict allowlists, and reasoning summaries to stabilize flows alias setup
CodeRabbit CLI brings AI code reviews to the terminal
CodeRabbit shipped a terminal-first AI reviewer that scans staged/unstaged changes, surfaces issues with navigable results, and copies a ready‑to‑paste “Fix with AI” prompt for your agent of choice. Works before PRs to squash bugs locally. CLI demo, CLI docs
- One‑line install, then cr review --plain to analyze a repo; navigate findings via h/l and copy remediation prompts with c run command, review navigation
- Prompts are agent‑agnostic (Cursor, Claude Code, Codex, etc.), speeding generate‑review‑iterate loops fix prompts
- Full walkthroughs cover setup, login, and usage flows for multi‑repo teams how to install, and CLI landing
- Early users report smoother terminal‑native quality gates for Codex/Claude/Gemini workflows product tweet
Amp adds a Codex-powered code review tool via CLI integration
Amp users can now call GPT‑5‑Codex from inside Amp via a new codex‑code‑review tool that takes a PR link, runs Codex CLI under the hood, and streams results back into the Amp session. Amp toolbox, Amp Owner’s Manual
- Tool wires Codex CLI into Amp’s toolbox to review diffs and produce actionable feedback inline Tool code
- Setup is copy‑paste: point Amp to docs and let it self‑configure; ensure Codex is authenticated first usage guide
- Keeps cost low if authenticated via ChatGPT while enabling Codex‑grade review quality usage guide
CopilotKit releases Gemini 2.5 + LangGraph template for full‑stack agent apps
CopilotKit published a reference project that embeds agents directly in-app using CopilotKit UI, Next.js, FastAPI, and LangGraph, with examples for a Post Generator (live search grounded) and a Stack Analyzer for GitHub repos. overview thread
- Blog walkthrough and open repo cover state graphs, streaming, tools, and UI wiring Tutorial blog, GitHub repo
- Practical patterns for production: structured JSON outputs via Pydantic and tool‑augmented workflows blog + repo
Crush IDE agent adds in‑app reasoning controls and faster dev loops
Charmbracelet’s Crush now lets you tune reasoning effort in‑app and shipped six updates in seven days: faster file watching, better LSP performance, Gemini improvements, smarter model search, and more. release note
- Open‑source repo and changelog highlight rapid iteration cadence GitHub repo, repo link
- Terminal perf brag: “2000t/s in the terminal” for snappy interactions throughput note
Codex CLI pro tip: ‘cdx’ alias enables full‑auto runs with search and reasoning summaries
A handy zsh/bash alias turns codex into cdx with sensible defaults: GPT‑5‑Codex model, full‑auto mode, web search on, and experimental reasoning summaries for quick operator insight. shell alias
- One‑function install: npm update shortcut and a codex --full-auto --search profile with model_reasoning_summary_format set shell alias
- Useful for day‑to‑day: faster starts, fewer flags, and consistent run hygiene across teams shell alias
🧪 New and Updated Models
Heavy day for model drops: OpenAI’s GPT-5-Codex, Google’s DP-trained VaultGemma 1B, Tencent’s Hunyuan-MT and Hunyuan3D 3.0, ByteDance’s HuMo, OpenBMB VoxCPM TTS, Qwen3‑Next, Ring‑mini‑2.0, plus stealth Gemini variants (Oceanstone/Oceanreef). Mostly model/eval releases and pricing hints; few voice items beyond VoxCPM.
OpenAI debuts GPT-5‑Codex, an agentic coding model with adaptive think time and multi‑hour autonomy
OpenAI released GPT‑5‑Codex, a GPT‑5 variant trained on real engineering workflows to act as a coding teammate that plans, tools, tests and ships over multi‑hour runs. It dynamically allocates “think time” (snappy on easy tasks, more deliberate on hard ones) and runs across IDE/CLI/web/cloud.
- Early users show the agent working continuously for 7+ hours, iterating, fixing tests, and landing PRs 7‑hour claim, long‑run demo
- Token‑use distribution shifts toward the long tail on difficult tasks (fewer tokens for easy work, much more for hard problems) per internal usage data adaptive effort
- Real‑world trials highlight strengths (planning, front‑end edits from screenshots) and limitations (occasionally choosing the wrong tool or over‑checking diffs) screenshots to fixes, tool misuse example, slow diff application
- Demand briefly outpaced GPUs; Codex ran ~2× slow before capacity was added and per‑user limits were reset capacity note, degraded speeds, capacity added, limits reset
Google Research ships VaultGemma 1B, a fully differentially‑private LLM with ε≈2 and open weights
VaultGemma is a 1B‑parameter Gemma‑family model trained end‑to‑end with differential privacy, offering formal sequence‑level privacy (ε ≤ 2) and no detectable memorization, while matching older non‑private baselines on classic benchmarks. Weights, code and tech report are public.
- Tech report details DP scaling laws, large‑batch training and privacy/utility tradeoffs; release includes weights on HF/Kaggle tech report link, bench overview
- On ARC‑E/PIQA/BoolQ/etc., performance lands near the GPT‑2 class while meeting strict DP guarantees bench overview
- Google positions VaultGemma as a starting point for private‑by‑design apps (regulated data, PII‑sensitive workloads) tech report link
ByteDance and Tsinghua release HuMo 17B/1.7B human‑centric video models (text/image/audio) under Apache 2.0
HuMo introduces subject‑consistent video generation controllable by text, reference images, and audio, with mask‑guided lipsync and robust subject preservation. The models (17B and 1.7B) ship under Apache 2.0 with paper, weights, and project page.
- Multi‑modal conditioning places video latents at the end of sequence for better identity control; mask predictor guides facial attention without freezing global motion conditioning detail, mask predictor
- Demo set shows stable subjects across scene changes and audio‑aligned lip motion; authors compare to OmniHuman (cleaner speech but different constraints) subject control, comparison
- First‑party links: project page, HF weights, arXiv paper project links
Tencent unveils Hunyuan3D 3.0 with 1536³ geometry and ultra‑HD voxel modeling
Hunyuan3D 3.0 upgrades precision (3×), pushes geometric resolution to 1536³, and introduces 3.6B‑voxel ultra‑HD modeling for lifelike faces and faithful structure reconstruction. It’s available via Hunyuan 3D AI Engine (free tier) and Tencent Cloud API.
- Highlights include layered generation for hidden‑detail recovery, enhanced texture fidelity, and stronger input‑image adherence feature highlights
- A livestream demo is scheduled to showcase production‑grade assets and workflows livestream tease
OpenBMB launches VoxCPM 0.5B: tokenizer‑free TTS with zero‑shot voice cloning and context‑aware prosody
VoxCPM 0.5B is a tokenizer‑free TTS system (MiniCPM‑4 backbone) that generates natural, context‑aware speech and clones voices from short emotional clips. It targets lifelike prosody with a small model footprint and ships demos and code.
- Claims include hyper‑realistic speech, zero‑shot cloning, and natural rhythm/intonation; trained on 1.8M+ hours model overview
- Live demo and repos are available on Hugging Face and GitHub for immediate testing/integration model overview, model overview
Google quietly tests new Gemini variants “Oceanstone” and “Oceanreef” on LM Arena
New Google Gemini‑family models surfaced on LM Arena, with Oceanstone first and now Oceanreef appearing with a September 2025 knowledge cutoff indicator in prompts. This extends Google’s live field‑testing of stealth variants before official release.
- Screens show Oceanstone/Oceanreef self‑identifying as Google‑trained LLMs; Oceanreef responses cite Sept 2025 cutoff (e.g., acknowledging Trump as president) Oceanstone sighting, Oceanreef sighting
- This follows the earlier Oceanstone appearance initial sighting, now expanding to another variant; a separate AI Studio UI leak hints at coming model selection beyond Gemini 2.5 Pro AI Studio model selector
Hunyuan3D Studio publishes end‑to‑end pipeline for game‑ready 3D assets (Unity/Unreal)
Tencent’s Hunyuan3D Studio released a tech report on a modular, production‑oriented pipeline that spans part‑level generation, topology/UV (SeamGPT), PBR textures, and auto‑rigging—optimized for real‑time engines and faster content creation.
- Describes a full stack from data to optimized meshes and textures, targeting rapid, consistent game‑asset output pipeline summary
- ArXiv paper available with system design and component details arXiv paper
Seedream 4 High‑Res surges on LMArena, tying Nano Banana for #1 text‑to‑image and ranking #2 for edits
After being added to LMArena by request, ByteDance’s Seedream 4 High‑Res rapidly accrued votes and now ties Gemini’s Nano Banana at the top of the T2I leaderboard, while placing #2 for image editing.
- Early totals show ~3.7k votes already influencing rankings; edit variants also leapfrog internal baselines leaderboard update, edit rank, arena link
Ring‑mini‑2.0 (16B total, 1.4B active) targets strong logical reasoning at sub‑10B dense quality
Ring‑mini‑2.0 is a lightweight sparse‑activation model claiming dense‑model‑class reasoning under 10B parameters. Authors report competitive scores on LiveCodeBench, AIME 2025, GPQA, and ARC‑AGI‑v1 while keeping output lengths comparable to larger MoEs.
- Configuration: ~16B total parameters with ~1.4B active per token; demoed via quick chat app builds model summary
- Public space available for hands‑on trials HF demo