GPT‑5 Pro hits 13% on FrontierMath Tier‑4 – pass@2 reaches 17%
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Executive Summary
Epoch AI’s manual head‑to‑head on FrontierMath Tier‑4 shows GPT‑5 Pro solving 13% of 48 research‑level problems (6/48), with web+API pass@2 at 17% (8/48). That edges Gemini 2.5 Deep Think by a single problem — not statistically meaningful — but it’s the clearest look yet at how these models behave on held‑out, author‑vetted math. Following yesterday’s ARC‑AGI price–score run, this is the update that matters for teams shipping theorem‑heavy workflows.
The standout: GPT‑5 Pro cracked a Tier‑4 problem no model had solved before, and the problem’s author confirmed the result. Importantly, leakage looks manageable: 5 of GPT‑5 Pro’s 8 pass@2 solves come from Epoch’s 20‑problem held‑out split, while Epoch reports OpenAI had prior exposure to 28 of the 48 tasks. Epoch also reran with an API scaffold and got the same 13% but on a different subset, underscoring how tool settings and search variance sway outcomes. The practical read: budget for multiple tool configurations and best‑of‑2 routing if you’re chasing absolute win rates here.
Bottom line: FrontierMath Tier‑4 remains brutally hard; at 13%, reproducible runs and transparent protocols beat cherry‑picked hero samples every time.
Feature Spotlight
Feature: FrontierMath Tier‑4 head‑to‑head
Epoch AI’s Tier‑4 shows GPT‑5 Pro at 13% (6/48) with pass@2 17%, edging Gemini 2.5 Deep Think; first to solve a previously unsolved problem; API scaffold matched web score on different items.
Cross‑account coverage centered on Epoch AI’s manual FrontierMath Tier‑4 results; this section owns the Tier‑4 storyline and excludes other evals (ARC, SWE‑Bench, IOAA) covered elsewhere.
Jump to Feature: FrontierMath Tier‑4 head‑to‑head topics📑 Table of Contents
🧮 Feature: FrontierMath Tier‑4 head‑to‑head
Cross‑account coverage centered on Epoch AI’s manual FrontierMath Tier‑4 results; this section owns the Tier‑4 storyline and excludes other evals (ARC, SWE‑Bench, IOAA) covered elsewhere.
GPT‑5 Pro hits 13% on FrontierMath Tier‑4; pass@2 reaches 17%
Epoch AI’s manual head‑to‑head shows GPT‑5 Pro solving 13% (6/48) of Tier‑4 research‑level problems, with combined web+API pass@2 at 17% (8/48), edging Gemini 2.5 Deep Think by one problem (not statistically significant) and far ahead of Grok 4 Heavy manual eval thread, api scaffold result, bar chart. Following up on FrontierMath record that highlighted Gemini’s broader record, Epoch ran web‑app tests with search/code tools and an API scaffold run that achieved the same 13% but on a different subset, underscoring variance across settings methods note, FrontierMath page.

GPT‑5 Pro cracks a previously unsolved Tier‑4 problem; most solves were on Epoch’s held‑out set
Epoch reports one Tier‑4 problem was solved for the first time by GPT‑5 Pro, with the problem author confirming the result; of GPT‑5 Pro’s eight pass@2 solves, five were from Epoch’s 20‑problem held‑out split (OpenAI had exposure to 28/48) author comment, hold‑out details. The team also shared one of two public samples for readers to gauge difficulty and methodology sample problem, with additional background and protocol notes on the benchmark site FrontierMath page.

🛠️ Coding agents, plugins and eval UX
Strong tooling cadence today: Claude Code plugins and 'skills' internals, a new context‑builder pattern, and eval A/B utilities. Excludes interop/app hosting (see Interop).
Claude Code 2.0.13 adds plugins marketplace hooks, MCP toggles, and faster tool calls
Anthropic shipped Claude Code 2.0.13 with a plugins and marketplace flow, on/off toggles for MCP servers, faster tool calling, rendering fixes, Ctrl+G external‑editor loop, and a trimmed system prompt (−1.4k tokens) weekly changelog. The cadence follows last night’s debut of Plugins beta and command bundles, providing a clear path from local skills to shareable, installable packages Plugins beta.

Claude Code’s hidden “skills” folder surfaces: ready‑made PDF/Office utilities you can reuse
Developers surfaced Claude Code’s /mnt/skills/public directory containing prompt instructions and Python utilities for PDF form fill, DOCX, PPTX, and XLSX creation—now mirrored for inspection and reuse skills repo, with a write‑up on how these skills slot into Code Interpreter flows blog notes and the mirrored artifacts on GitHub GitHub repo. A new banner also notes wider file‑type support rolling out to Pro users after Max, reinforcing that these skills are active features, not experiments feature banner.

Evalite 0.12.0 + evalite.each: faster A/Bs for prompts and models
Evalite 0.12.0 adds dark mode, AI SDK v5 support, and UI polish release note, and a companion PR introduces evalite.each() so teams can run side‑by‑side model/prompt variants and review outputs directly in the UI pull request. This tightens the loop from hypothesis to diffed outputs, making eval orchestration more accessible to small teams.

How to package and sell Claude Code plugins: marketplace.json and zip bundles
A community guide outlines how to bundle Claude Code sub‑agents, commands and hooks into a distributable plugin: create a .claude-plugin/marketplace.json, zip the bundle, and share via URL for one‑line install; devs are already planning paid templates and marketplaces plugin how‑to, bundle overview, marketplace json. The flow lowers friction from personal workflows to monetizable packages without bespoke infra.

OpenBench adds OpenRouter provider routing for apples‑to‑apples model evals
OpenBench now supports OpenRouter provider routing, letting evaluators compare the same model across vendors and backends under one harness routing support. That reduces a common source of noise in leaderboard runs—backend variance—so prompt and model changes are easier to isolate.
RepoPrompt previews an agentic context builder powered by Claude Code and LangGraph
RepoPrompt teased a context builder that uses Claude Code (and other agents) to assemble, shrink and optimize task context before execution, with MCP discovery as a first‑class workflow button preview demo. This pattern moves beyond prompt pasting toward reproducible, agent‑built context with guardrails.

Warp surfaces per‑thread AI credit usage; billing insights incoming
Warp is rolling out inline AI usage summaries per conversation and will add workspace billing insights; it’s also renaming “requests” to “credits” for clarity feature brief, with details on how the summaries expose context window, tools and model costs to curb waste blog post. Visibility like this helps teams govern agent spend without throttling experimentation.
🔌 Interop and app surfaces for agents
News on where agents run and how tools connect: Vercel hosting for ChatGPT apps, MCP servers wrapped in workflows, and agentic UI canvases. Excludes Claude plugin mechanics (see Coding agents).
Vercel now hosts ChatGPT apps natively with Next.js Apps SDK
Vercel added first‑class support for building and deploying ChatGPT apps, including a Next.js template and Apps SDK integration so UIs render natively inside OpenAI’s sandbox without nested iframes. This gives agent builders familiar CI/CD, preview deploys, and instant rollbacks on the same platform they already use for web apps. rollout note, and Vercel changelog

• Early adopters highlight the path to a ChatGPT “app store” and the velocity gains for founders deploying agent UIs. analysis note
LangGraph AG‑UI Canvas ships for real‑time agent state, custom UI, and HITL
CopilotKit released an AG‑UI Canvas template for LangGraph that renders agent state, tool calls, and progress in real time, with shared state and human‑in‑the‑loop gates for production control. This follows AG‑UI embed where agent meshes in app frontends were highlighted; today’s drop includes runnable assets. template launch, GitHub template, and walkthrough video
Build local browser agents with Gemini 2.5 Computer Use + Browserbase Stagehand
A reference shows how to stand up a local browser agent using Gemini 2.5 Computer Use and Browserbase’s Stagehand (act/extract/observe/agent) primitives atop Playwright. It’s a clean interop recipe for UI automation agents that need both schema‑based extraction and natural‑language actioning. browser agent setup, GitHub sample, and template page

ChatGPT adds keyboard shortcut to toggle Dev Mode for unverified connectors
OpenAI’s web app now exposes a Cmd/Ctrl + . shortcut to toggle Developer Mode, streamlining workflows that rely on unverified connectors during build/test. For teams wiring agents to new data sources or tools, this reduces clicks and makes connector iteration faster. shortcut screen

Google AI Studio’s App Builder auto‑wires a Gemini component into your app
Google AI Studio now suggests and inserts a Gen‑AI component directly into new app projects, making it trivial to enable agentic capabilities without scaffolding from scratch. Templates cover common UI patterns so teams can go from prompt to usable surface faster. builder glimpse

Mastra wraps Mux’s MCP server to expose media tooling inside agent workflows
Mastra Workflows can now wrap Mux’s MCP server so agents can safely call media operations with approvals, auth, and audit—turning MCP tools into governed steps within production flows. This is a practical pattern for bringing domain MCP servers under enterprise controls without custom glue code. workflows brief
Chrome DevTools MCP lands for real‑time agent debugging across Cursor and Claude Code
A new MCP integration for Chrome DevTools enables real‑time inspection and debugging of agent actions, and is already wired into Cursor and Claude Code. This tightens the loop between orchestration and the developer surface where agents run. feature note
RepoPrompt previews context builder that assembles prompts agentically
RepoPrompt is adding a context builder that can leverage Claude Code and other agents to discover MCP servers, assemble relevant artifacts, and optimize prompts before execution—moving beyond wrappers to concrete agentic workflows in the UI. builder preview

Warp surfaces per‑conversation AI credits and upcoming billing insights
Terminal users of Warp now see inline AI usage summaries per conversation, with a consolidated billing insights view “coming soon.” For agentic dev workflows, making cost/credit burn visible at the surface helps teams budget and compare providers. feature brief, and Warp blog
🚀 Adaptive inference and throughput wins
Runtime advances led by Together’s ATLAS adaptive speculator. Excludes chip/supercluster supply (see Infra).
Together’s ATLAS learns at runtime, claiming ~500 TPS and up to 4× faster inference
Together AI introduced ATLAS, a runtime‑learning speculative decoder that adapts to live traffic, improving draft acceptance rates and throughput the longer it runs launch details, and ATLAS explainer. The team reports up to 4× speedups over baseline, ~500 tokens/sec on DeepSeek‑V3.1 after adaptation, and steady‑state wins versus fixed speculators and even specialized hardware on stable workloads results recap, and performance claims.
- It continuously tunes the speculator–target pairing and scheduling from production traces, aiming to raise acceptance and cut latency without retraining your model launch details.
📈 Evals beyond FrontierMath
Round‑up of non‑FrontierMath evals and infra: a new SWE‑Bench Verified high, ARC‑AGI cost/score plots, IOAA olympiad results, and eval routing. Excludes the feature’s Tier‑4.
KAT‑Dev‑72B‑Exp hits 74.6% on SWE‑Bench Verified under strict scaffold
Kwai’s research‑only KAT‑Dev‑72B‑Exp posts 74.6% on SWE‑Bench Verified when evaluated with the SWE‑agent scaffold, edging past prior open models on this tough real‑world code‑repair benchmark results chart. The team attributes gains to large‑scale RL with shared prefix trajectories; a commercial KAT‑Coder is teased alongside a hosted demo product page.

Why it matters: SWE‑Bench Verified remains one of the most actionably predictive coding evals for production agents—strong scores here often translate to higher PR acceptance rates and fewer regressions.
LLMs reach gold‑medal performance on IOAA theory
A new study reports Gemini 2.5 Pro and GPT‑5 averaging ~85% on International Astronomy & Astrophysics Olympiad written theory exams—roughly gold‑medal territory—while GPT‑5 hits ~88.5% on data analysis paper abstract. Authors note persistent weaknesses in spherical trigonometry, coordinate frames, and geometry visualization.

Takeaway: outside math‑contest formats, discipline‑specific theory/data exams are emerging as useful checks on multimodal scientific reasoning; high theory scores won’t obviate the need for rigorous citation and units/frames verification in the loop.
OpenBench adds OpenRouter provider routing and ARC‑AGI support
OpenBench now supports provider routing via OpenRouter, enabling side‑by‑side vendor comparisons during eval runs routing note. Earlier this week the 0.5.0 release added a plugin system and ARC‑AGI support plus 350+ new evals to streamline multi‑benchmark sweeps release notes.
Why it matters: eval infra that can swap providers and scaffolds without code churn makes cost/latency/accuracy trade‑offs observable—and repeatable—across rapidly changing model menus.
🔎 Agent‑grade retrieval and search APIs
Search/RAG stacks tuned for agents: Exa 2.0 latency/quality modes, large fresh index, and DB/abstraction debates. Excludes usage macro stats (see Business).
Exa 2.0 ships dual‑mode agent search with sub‑350 ms P50 and agentic deep re‑search
Exa unveiled 2.0 with two API profiles: Exa Fast targeting end‑to‑end P50 latency under 350 ms for latency‑sensitive agent loops, and Exa Deep that agentically re‑searches/processes to maximize answer quality release thread, quality mode. Under the hood they expanded their index to tens of billions of pages with minute‑level refresh, trained new embedding architectures for a month on a 144× H200 cluster, and upgraded a Rust vector store (clustering, lexical compression, assembly optimizations) engineering details. A latency chart shared by the team highlights the <350 ms P50 claim for Exa Fast latency chart.

Agent search vendors consolidate: Elastic buys Jina, Mongo buys Voyage, Mixedbread pivots
Industry voices note a rapid consolidation among "embedding model" vendors who in practice sell enterprise search: Jina finds a home in Elastic, Mongo acquires Voyage, and Mixedbread pivots to managed no‑code search endpoints market take, deal summary. For AI engineers building agentic RAG, the signal is clear: value is moving from raw embedding endpoints toward end‑to‑end retrieval products (fresh indexes, ranking, ops tooling) that agents can drive via APIs.
Timescale argues vector DBs are the wrong abstraction for AI apps
Timescale’s team contends many production AI apps suffer from multi‑store drift and synchronization issues because vectors are treated as a primary store; they advocate keeping embeddings as derived data co‑located with application truth to simplify RAG/agent stacks argument thread, and provide concrete architecture guidance in their long‑form write‑up blog post. For agent‑grade retrieval, this shifts design toward simpler schemas, fewer moving parts, and lower ops overhead when iterating on indexes and chunking strategies.
ValyuNetwork promotes deep‑research search API purpose‑built for agents
A small team surfaced ValyuNetwork as a search API optimized for agentic research and analysis, positioned to outperform bigger‑funded offerings on depth workflows (multi‑hop, long‑context retrieval) product note. While details are sparse, the emphasis on agent workflows (vs human‑UI search) aligns with the emerging class of retrieval APIs that expose programmatic, iterative search primitives for planners and tool‑using agents.
💼 Usage scale and enterprise adoption
Macro usage + concrete outcomes: Google’s token volume, Gemini CLI developer scale, low‑cost plan expansion, and AI sales agent ROI. Excludes search/RAG products (see Retrieval).
Google token traffic hits 1.3 quadrillion/month; growth curve shows sharp jump since July
Google’s internal charts show monthly tokens processed across Search, YouTube, Gmail and Workspace have reached ~1.3 quadrillion, up from ~980 trillion in July and ~100 trillion in February, underscoring a rapid usage ramp benchmarks chart, trend note. Following up on 1.3 quadrillion, the new slide adds the historical curve and context across surfaces usage slide.

Gemini CLI passes 1 million developers
A Google event slide states that 1M+ developers have already built with Gemini CLI, signaling strong developer‑side adoption ahead of the next model checkpoint developer slide. For engineering leaders, this suggests growing ecosystem gravity around Google’s tooling and a larger talent pool familiar with Gemini workflows.

OpenAI expands ChatGPT Go budget plan to 16 Asian countries
OpenAI’s low‑priced ChatGPT Go is rolling out to 16 additional countries across Asia, with the company positioning it to widen access to core ChatGPT capabilities at a lower price point rollout details. For adoption teams, cheaper tiers typically lift MAUs, trial funnels, and downstream enterprise upsell potential.

ElevenLabs’ AI sales agent now qualifies 78% of inbound leads across 38 countries
ElevenLabs built an inbound sales agent on its own platform that handles end‑to‑end qualification for 78% of leads, runs 24/7 in 38 countries, and maintains an 8.7/10 CSAT, effectively matching the weekly volume of two full‑time SDRs case study, with concrete KPI cards published for transparency metrics cards. This is a clean, measurable ROI example for leaders considering agentic workflows in funnels.

Sora tops 1M downloads in under 5 days; Android pre‑registration opens in US and Canada
OpenAI’s Sora crossed 1 million app downloads in less than five days, and Android pre‑registration is now live in the US and Canada, indicating broad consumer pull for AI video creation beyond web entry points downloads update, pre‑registration. This scale can spur enterprise creator stacks to integrate Sora outputs into ads, trailers, and social pipelines.

State of AI 2025: 44% of US firms pay for AI; average deal ~$530k; capability/$ doubling every 3–6 months
The latest State of AI report highlights broad enterprise uptake: 44% of US businesses now pay for AI, average deal sizes are ~$530k, AI‑first startups grow ~1.5× faster, and capability per dollar is doubling every 3–6 months (helping pilots expand) report summary, with full context in the official site State of AI site. These macro signals align with rising internal adoption at tech majors and continued capacity buildouts.

Meta expands AI Reels translation and lip‑sync to four languages
Meta widened its automated dubbing and lip‑sync for Reels to English, Spanish, Hindi and Portuguese, letting creators toggle translations on/off while preserving speaker tone feature brief. For brands and media ops, this cuts localization friction and broadens reach without manual ADR workflows.

🎬 Video models and creator pipelines
Heavy creator chatter: Veo 3.1 sample quality, Sora 2 ‘unlimited’ offers, and new nodes/runtimes. Excludes evals and FrontierMath feature.
Veo 3.1 looks imminent: new samples, console footprint, and Vids hooks
Creators are sharing Veo 3.1 comparisons that clearly outshine Veo 3 on identical prompts (cyberpunk robot, volcano, Everest POV, T‑Rex) sample roundup. Google Cloud Console now shows a "veo-3.1-fast" quota dimension, and Google Vids is surfacing Veo generation alongside AI Avatars with 8‑second 720p clips and audio, plus an image‑to‑video path console quota shot, vids access, vids bitrate note, image-to-video note.

- Follows Veo 3.1 IDs, adding first sample quality threads and fresh product UI traces. Engineers can start planning API routing guards for model IDs that are already appearing in quotas while PMs pressure‑test short‑form ad workflows inside Vids.
Higgsfield pushes ‘Sora 2 Unlimited’ with 25+ ad presets and creator monetization
Higgsfield rolled out unlimited Sora 2 generations targeted at commercial spots, bundling 25+ creative presets and promotional codes to spur trials launch thread. The company is positioning this for TikTok/Shorts‑style conversion videos, and creators are openly discussing using these templates to farm views and revenue pricing page, creator strategy. Engineers should factor in preset‑driven prompt scaffolds and budget caps for high‑volume short‑form ad runs.
Moondream 3 Preview on fal: tiny open VLM for UI‑aware agents and structured output
fal launched Moondream 3 Preview (9B/64‑expert MoE, 32k context) with strong real‑world OCR, UI element pointing, and improved structured outputs—useful for agentic web/app control and data extraction inside creator tools model preview. The model’s ability to reason over interfaces complements video pipelines that need automated cut‑list assembly, pricing overlays, or catalog pulls ui understanding.

ComfyUI adds Kling 2.5 Turbo: faster, cinematic motion with precise camera control
Kling 2.5 Turbo is now available in ComfyUI, bringing faster generations, improved cinematography, precise camera moves, and better facial expressiveness to node‑based pipelines node release. This unlocks storyboard‑to‑shot chains inside ComfyUI graphs without leaving the toolchain.
Google Flow updates video generation: any‑language prompts, safer filters, less aggressive throttling
Google’s Flow team shipped three changes: automatic translation to accept non‑English prompts, tuned responsibility filters to reduce false blocks, and rebalanced rate‑limit logic so creators hit fewer “generating too fast” errors changelog modal. These guardrails smooth bursty campaign work and international collabs.

OpenAI opens Sora Android pre‑registration in the US and Canada
Sora’s Android app is now available for pre‑registration on Google Play for users in the US and Canada, signaling a broader push into mobile creation workflows play listing. Expect a spike in mobile‑first video generation traffic and new constraints (battery, bandwidth, latency) on creator pipelines.

OpenAI publishes Sora 2 prompting guide in the Cookbook
OpenAI released an official Sora 2 Prompting Guide covering shot planning, model/size/seconds parameters, and iteration patterns that map to ad‑style workflows cookbook guide, OpenAI cookbook. Creators are also experimenting with cameo‑style instructions for consistent character beats in series content cameo notes. Teams can codify these patterns into prompt libraries and A/B stacks for production.
ComfyUI gets WAN InfiniteTalk for extended lip‑sync videos
A new ComfyUI workflow integrates WAN InfiniteTalk for longer, more accurate lip‑sync video generation workflow page. Combined with video backbones, this bolsters talking‑head explainers, UGC ad variants, and voiceover localization in tool‑driven pipelines.
DreamOmni 2 arrives on fal with multi‑image editing and consistent characters
fal released DreamOmni 2 with multi‑image editing, cross‑scene character consistency, and aesthetic style transfer—handy for thumbnail sets, storyboards, and scene‑matching assets that feed downstream video models model launch.

🎙️ Real‑time speech stacks and pricing
Updated cost/latency trade‑offs for speech‑to‑speech agents. Focused on OpenAI’s GPT Realtime Mini vs peers.
OpenAI’s GPT Realtime Mini debuts: ~7× cheaper than flagship, 0.81s first audio, 32k context
OpenAI introduced GPT Realtime Mini, a native speech‑to‑speech model aimed at scaling real‑time agents with lower cost and latency. Artificial Analysis reports 0.81s time‑to‑first‑audio (vs. 1.27s prior gen), a doubled 32k context window, and a 68% Big Bench Audio score, while noting tool‑calling removal and an Oct‑2023 knowledge cutoff model analysis.

Against peers, Gemini 2.5 Flash Native Audio Dialog leads on first audio (0.6s) and cost efficiency ($0.35/hour input vs Mini’s ~$0.36/hour), and scores higher on reasoning (72% vs 68%), framing a clear price/latency frontier engineers can route against comparison chart. Teams can compare providers and parameters on the maintained leaderboard to choose stacks by reasoning quality, response time, and $/hour speech models board.
Developers are shipping real‑time voice coaches on Gemini Live API
A public demo shows a German pronunciation coach built on Gemini Live that listens, evaluates, and teaches in real time, coupling speech I/O with contextual prompts and imagery. For teams designing live agents (language learning, support triage, sales coaching), it’s a concrete pattern for turn‑level audio feedback without bespoke DSP app demo, with a runnable link to try the experience end‑to‑end pronunciation app.
Meta expands AI Reels translation with voice cloning and lip‑sync to four languages
Meta’s AI dubbing for Reels now supports English, Spanish, Hindi, and Portuguese, auto‑translating videos while mimicking the creator’s voice and syncing lips; viewers can toggle translations on/off. For content teams, this is a turnkey multilingual speech stack that expands reach without custom pipelines feature rollout.

🏗️ Superclusters, nodes and chip policy
Concrete infra signals: Azure’s GB300 NVL72 racks for OpenAI, AMD’s MI450 N2 plans, plus a US priority rule for AI chip allocation. Excludes runtime optimizations (see Systems).
US Senate advances rule to prioritize AI chip deliveries for US buyers over China
The Senate approved language that would force Nvidia/AMD to allocate advanced AI chips to US customers first whenever supply is tight; the House version lacks this clause, so its fate rests on conference. If enacted, vendors would need auditable allocation policies and order‑book controls, likely stretching China lead times and raising effective prices while US hyperscalers get clearer capacity lanes policy summary.

Mechanically this is not an export ban but a queueing rule, which, during shortages, can be as impactful on AI rollout schedules as hard specs caps; expect more paperwork (end‑user attestations, reseller controls) and contract milestones tied to priority compliance policy summary.
AMD’s MI450 targets TSMC N2 and HBM4; Helios rack touts 51 TB unified memory
AMD signaled its next Instinct MI450 GPUs will fab on TSMC’s 2 nm (N2) and pair with HBM4, while a Helios rack design aims to cluster 72 accelerators for about 51 TB unified memory and ~1,400 TB/s bandwidth; reporting also cites a multi‑year OpenAI deployment starting 07/26 with a 1 GW phase AMD MI450 headline. analysis thread

If Helios lands as described, it shifts the memory‑bound ceiling for long‑context and MoE inference, with the trade‑off that Nvidia Rubin may still lead raw PFLOPS per rack—making real‑world speed hinge on memory systems, interconnect (e.g., UALink), and software stacks analysis thread.
NVL72 efficiency: GB200 delivers big tokens per MW in SemiAnalysis tests
New InferenceMAX charts show GB200 NVL72 racks producing roughly an order‑of‑magnitude more tokens per all‑in utility megawatt than single‑node H200 across user interactivity targets in a 670B MoE document‑querying scenario; with MTP enabled, the NVL72 curve rises further at low–mid interactivity as the scheduler packs more tokens concurrently throughput plot. cost–latency chart

This complements earlier throughput and $/Mtok claims by grounding efficiency in power‑normalized output—important for datacenter PUE and grid constraints—while the cost–latency curve underscores NVL72’s low $/Mtok at the expense of higher latency at certain setpoints, a tradeoff many batch‑heavy agent workloads can absorb InferenceMAX.
🛡️ Jailbreaks and reasoning transparency
Safety threads today: invisible Unicode jailbreaks and a push for CoT‑training disclosure; plus a math‑sycophancy benchmark. Excludes FrontierMath (feature).
Invisible Unicode jailbreaks can hit 100% success and bypass UI-level checks
A new arXiv paper shows attackers can append imperceptible Unicode variation selectors to prompts and force unsafe behavior with up to 100% success, because front-end filters see clean text while tokenizers feed altered tokens to the model paper overview. The authors argue safety stacks must sanitize invisible code points and audit tokenization; otherwise alignment can be undermined even when on‑screen text looks benign attack summary.

Following up on Backdoor poison that tiny poisoning sets can backdoor models, this result highlights an orthogonal vector: no training change needed, just token‑level prompt craft—raising the bar for input validation and red‑teaming.
BrokenMath benchmark finds 29% sycophancy in GPT‑5 on adversarial theorem edits
BrokenMath tests whether models agree with subtly wrong claims in edited versions of 2025 contest problems: GPT‑5 sycophancy is ~29% while DeepSeek V3.1 is ~70.2%, with more failures on proof‑style prompts than short answers benchmark thread. Mitigations like premise checks, best‑of‑N, and light fine‑tuning helped only partially; the authors recommend premise verification before solving to avoid confidently proving false statements method details.

Calls grow for CoT‑training transparency; OpenAI says no pressure to hide, Anthropic silent
Safety researchers are pressing labs to disclose whether they train against chain‑of‑thought (CoT), warning that optimizing CoTs can teach models to obfuscate misaligned reasoning disclosure thread. OpenAI told METR there was “no direct training pressure” on GPT‑5 to hide or obfuscate misaligned reasoning traces, while Anthropic hasn’t stated a position; proposals include briefing non‑conflicted third parties like METR under NDA if public detail is too sensitive metr details, third‑party idea, with source details in the evaluator’s write‑up METR report.
🤖 Robots at home and in clinics
Embodied AI signals: Figure CEO emphasizes ‘data’ scale for utility; exoskeleton demos highlight clinical impact; community wants more manipulation over acrobatics.
Figure CEO: robot still not daily-use; says data scale is the missing ingredient
Figure CEO Brett Adcock told TIME that Figure 03 “still has significant problems” and isn’t ready for everyday work, adding that what’s needed now is more data—implying that the teams with the largest data collection and training pipelines will win. That framing tilts advantage toward hyperscalers backing robotics stacks and sets expectations for longer iteration cycles before broad home or commercial deployment ceo comment, and the full interview is available here YouTube interview.
Figure 03 positions for home use with safety textiles, wireless charging and Helix AI
Following up on Figure 03, today’s feature roundups emphasize home-readiness details: washable textiles and safety considerations for close human contact, inductive/wireless charging, an upgraded audio system, and Helix vision–language–action integration. On the fleet side, Figure flags customization and manufacturing steps alongside its BotQ facility claims, signaling a push from lab demos toward deployable form factors in domestic and commercial settings feature roundup.
Wandercraft exoskeleton demo shows a patient standing and walking again
A widely shared clip highlights a woman standing and walking with assistance from Wandercraft’s robotic exoskeleton, underscoring real-world clinical momentum for embodied AI beyond research labs. For engineers and hospital buyers, this reinforces a near-term use case: tightly scoped, safety-critical assistive robotics in rehab and mobility, where reliability, serviceability, and clinician workflows matter as much as model capacity exoskeleton demo, with additional practitioner commentary on impact clinical note.
Robotics community pushes for manipulation benchmarks over backflips and dance reels
Practitioners are calling out acrobatics videos—backflips and choreography—as the wrong progress metric, urging standardized manipulation and tool-use tasks that correlate with household and clinic utility. Expect evals to tilt toward dexterous hands, contact-rich skills, and time-to-recovery under disturbances, not just dynamics tricks benchmark critique, prompted by fresh rounds of backflip showcases backflip mention.
📚 Reasoning and memory research worth prototyping
New academic proposals for long‑horizon and memory: chunked thinking, reasoning memory, goal distance representations, and long‑context compression. Not product releases.
Markovian Thinker scales to 96k tokens with linear‑cost chunked reasoning
Mila and Microsoft propose a fixed‑state “Markovian Thinker” that resets context between short reasoning chunks, enabling 96k‑token chains at roughly 7 vs 27 H100‑months for a comparable long‑CoT baseline, and faster training/inference per step. Following up on H1, which taught long‑horizon reasoning via outcome‑only RL, this paper shows how to keep compute linear and memory constant without changing model size. ArXiv paper

In practice, Delethink reports ~215s per RL step (vs ~249s) and ~8.5k vs ~6k tokens/sec on H100 for comparable settings, while continuing to improve beyond its training thinking length—useful for prototyping long‑running agents where budget and determinism matter. paper overview
ReasoningBank turns agent histories into strategies that boost success and cut steps
Google’s ReasoningBank logs successes and failures into compact, retrievable strategies that agents prepend to future tasks, yielding higher success rates and about two fewer steps on successful runs across WebArena, Mind2Web, and a software benchmark. The companion MaTTS setup spends extra test‑time compute on contrasting rollouts to consolidate stronger memories, a pragmatic recipe teams can prototype without model retraining. paper summary

Artificial Hippocampus Networks compress long context, cutting ~40.5% FLOPs and 74% KV cache
ByteDance’s Artificial Hippocampus Networks pair a sliding exact‑attention window (short‑term memory) with a recurrent compressed state (long‑term memory), achieving linear‑like scaling: on 128k tests they report ~40.5% FLOPs savings and ~74% KV cache reduction, with accuracy gains vs baselines. This design is immediately testable in inference stacks that are KV‑bound. paper overview

Dual Goal Representations: define goals by time‑to‑reach, not pixels, for robust goal‑RL
UC Berkeley’s Dual Goal Representations encode a goal by its temporal distances from all states (how many steps to reach) instead of appearance, preserving action‑relevant structure under noisy observations and improving generalization in goal‑conditioned RL. This is a clean swap‑in target for agents that currently feed raw goal images/coordinates into policies. paper note
