OpenAI GPT‑5‑Codex – 400k context; $1.25/M input rolls out
Stay in the loop
Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.
Executive Summary
OpenAI’s GPT‑5‑Codex is now everywhere developers actually work: the Responses API, Codex CLI 0.40, Cursor, Cline, Factory, OpenRouter, and more. It pairs a 400k‑token context with agent‑tuned prompting and clean pricing—$1.25/M input and $10/M output—aiming to spend fewer tokens on trivial edits and scale thinking for hard refactors. Early compile‑and‑fix runs are promising, with GPT‑5‑Codex‑High tying for the top slot on messy real‑world builds.
In numbers:
- 400k tokens context; $1.25/M input; $10/M output via Responses API
- Codex CLI 0.40: /review and /status; auto‑compact triggers near 220k tokens
- Cursor Agent enabled; Figma→code demos; multi‑file edits with adaptive reasoning
- Cline integration: 400k context; “thinking slider” tunes variable reasoning and token spend
- Factory web/CLI: long‑running tasks and autonomous PRs wired to GPT‑5‑Codex
- OpenRouter: model live; Responses API alpha adds tools/search; $10 feedback credit
- CompileBench: GPT‑5‑Codex‑High 91% pass@1 and 93% pass@3 under dependency hell
Also:
- vLLM turns on full CUDA graphs by default; lower latency for short prompts
- Ollama ships a new scheduler; fewer OOMs and better multi‑GPU utilization
Feature Spotlight
Feature: GPT‑5‑Codex lands across IDEs and APIs
GPT‑5‑Codex arrives with 400k ctx, adaptive reasoning and API/IDE support—standardizing agentic coding patterns and raising the bar on long‑running autonomous dev tasks.
Cross‑account launch: OpenAI’s GPT‑5‑Codex rolls out to the Responses API, Codex CLI 0.40, Cursor, Cline, Windsurf, Factory, OpenRouter and more—shaping agentic coding workflows today.
Jump to Feature: GPT‑5‑Codex lands across IDEs and APIs topics- OpenAI: GPT‑5‑Codex added to Responses API; $1.25/M input, $10/M output; 400k context; Sept‑2024 cutoff; "less is more" prompting guide published
- Codex CLI 0.40: default model gpt‑5‑codex; auto‑compaction at 220k; new /review commands; /status shows rate limits
- Cursor: GPT‑5‑Codex available for Agent; live shipping demos; Figma→code flows highlighted
- Cline: GPT‑5‑Codex live; thinking slider synergy; blog notes 93% fewer tokens on simple tasks, more on complex
- Factory AI: "Droid" tuned for GPT‑5‑Codex; strengths in long‑running tasks and autonomous PRs across web/CLI
- OpenRouter: GPT‑5‑Codex available via OpenAI‑compatible API; Responses API alpha credit promo
📑 Table of Contents
🧑💻 Feature: GPT‑5‑Codex lands across IDEs and APIs
Cross‑account launch: OpenAI’s GPT‑5‑Codex rolls out to the Responses API, Codex CLI 0.40, Cursor, Cline, Windsurf, Factory, OpenRouter and more—shaping agentic coding workflows today.
OpenAI adds GPT‑5‑Codex to Responses API with 400k context and new prompting guide
GPT‑5‑Codex is now callable via the Responses API, bringing an agent‑optimized coding model (Sept‑2024 cutoff, 400k context) at $1.25/M input and $10/M output—built to use fewer tokens on simple edits and scale reasoning on hard refactors, following up on Responses API.
- Docs note immediate API access and model behavior details (see API update and model docs).
- OpenAI published a concise “less is more” prompting guide tailored to Codex workflows (prompting guide, and GPT‑5 Codex prompting guide).
- Pricing and limits confirmed by community trackers (pricing recap) and the help center release notes (release notes, and OpenAI docs).
Codex CLI 0.40 ships: gpt‑5‑codex default, /review commands, /status limits, auto‑compact at 220k
The Codex CLI rolled to v0.40 making gpt‑5‑codex the default model, adding a new /review flow (uncommitted, commit, or PR‑style), exposing rate limits at /status, and enabling auto‑compaction at ~220k tokens to keep long sessions moving.
- Release captured by early adopters with quick changelog screenshots (release ping, release notes).
- Maintainer confirmed specifics in the GitHub tag (GitHub notes).
- New review shortcuts highlighted by users integrating with workflow scripts (review overview).
Cursor turns on GPT‑5‑Codex Agent; live Figma→code demos roll out
Cursor enabled GPT‑5‑Codex in its Agent, with live sessions showing design‑to‑code flows and multi‑file edits guided by the model’s adaptive reasoning.
- Availability confirmed by the Cursor team and builders shipping features in real time (Cursor enable, live demo).
- Figma→code walkthroughs spotlighted as a flagship use case (design to code).
- Cursor’s updated site pitches the Agent as a human‑AI programmer for end‑to‑end tasks (site homepage, and Cursor site).
Cline adds GPT‑5‑Codex with 400k context and variable thinking tuned to a “thinking slider”
Cline integrated GPT‑5‑Codex and reports strong synergy with its thinking‑effort slider—using far fewer tokens on easy edits while expanding reasoning on complex changes.
- Launch announcement and write‑up emphasize 400k context and adaptive reasoning behavior (cline launch, details, and cline blog).
- Positioned for long‑running builds, debugging, and structured code reviews rather than chatty assistants (cline launch).
OpenRouter adds GPT‑5‑Codex and a Responses API alpha with credit promo
OpenRouter made GPT‑5‑Codex available via its OpenAI‑compatible API and launched an experimental Responses API (reasoning, tool calls, web search), offering credits for early feedback.
- Model card and app integrations listed (Cline, Goose, Kilo, Roo, Quests) (model launch, apps list, and GPT‑5 Codex page).
- Responses API alpha details and credit offer published (alpha post, credit promo, and Responses API docs).
Factory’s Droid tunes for GPT‑5‑Codex: long‑running tasks and autonomous PRs across web/CLI
Factory AI enabled GPT‑5‑Codex across its web app and CLI, citing strengths in long‑running tasks, autonomous pull requests, and adaptive reasoning for quick Q&A.
- Rollout and docs shared with quickstart pointers (Factory update, and Factory docs).
Early impressions of GPT‑5‑Codex: fast, self‑checks and strong at messy OSS builds
Developers report the model is “fast as hell,” good at catching its own mistakes, and competitive on realistic compile‑and‑fix tasks from legacy repos.
- Speed and feel from first runs (speed take).
- Self‑checking/correction praised in early trials (self‑check note).
- External benchmark context: CompileBench shows gpt‑5‑codex‑high tied at the top on pass@1/pass@3 under dependency hell (benchmark link).
Conductor turns on Codex with revived Big Terminal Mode for agentic coding
Conductor enabled Codex and brought back Big Terminal Mode, giving agents more screen real‑estate and context while they build and test in‑terminal.
- Rollout details and how to enable shared by the team (Conductor update, enable steps).
🏗️ Stargate expansion and the 10‑GW era
OpenAI+Oracle+SoftBank add 5 US Stargate sites (~7 GW, >$400B in 3 years), while Sam Altman outlines a factory to produce 1 GW/week. NVIDIA reassures allocation parity. Non‑AI macro excluded.
Huawei’s cluster‑first play: 8,192‑chip SuperPod roadmap and 4 Tbps links by 2028
Unable to match NVIDIA per‑chip speeds, Huawei pitches scaling via SuperPods (Ascend 950) linked by a UnifiedBus fabric so many modest chips act as one accelerator. A 2026 target details 8,192 chips, 6.7× more compute, 15× memory, and 62× bandwidth vs NVL144; 4 Tbps chip links are targeted by 2028. cluster plan
- Constraints: loss of TSMC, stalled 7nm/5nm yields, and manufacturing bottlenecks temper ambitions despite clustering strategy. cluster plan
- Highlights alternate routes to AI scale where cutting‑edge foundry access is limited, reinforcing network/fabric as a competitive lever.
🧬 New models: Qwen3‑Max, Qwen3‑VL 235B, Coder‑Plus & Guard
Heavy Qwen day: flagship Qwen3‑Max (instruct/thinking), open‑weights Qwen3‑VL‑235B‑A22B (Instruct/Thinking), Qwen3‑Coder‑Plus upgrade, Qwen3‑LiveTranslate‑Flash, plus LongCat‑Flash‑Thinking. Excludes GPT‑5‑Codex (feature).
Qwen3‑Max debuts: Instruct rivals top coding/math models, Thinking hits 100% on AIME25 and HMMT25
Alibaba’s new flagship Qwen3‑Max lands in two flavors: Instruct (no thinking) that challenges leaders on SWE‑Bench/Tau2/LiveCodeBench, and a Thinking mode that posts perfect scores on AIME25 and HMMT25 with Python. Following up on 30B omni, which unified any‑to‑any I/O, this extends the stack with a frontier‑class text model for code and reasoning. release thread scores chart Qwen Chat Model Studio Qwen blog post
- Instruct highlights: 69.6 on SWE‑Bench Verified (100‑turn OpenHands), 69.0 LiveCodeBench v6, 74.8 τ²‑Bench (weighted). release thread
- Thinking highlights: 100.0 on AIME25 and HMMT25 (Python allowed), 85.4 GPQA; designed for heavy tool use. scores chart
- Available now via Qwen Chat and Alibaba Cloud Model Studio API; no preview gating. Qwen Chat Model Studio
- Positioning: complements Qwen’s omni/multimodal push with a pure‑text flagship tuned for long‑horizon coding and math. release thread
Qwen3‑VL‑235B‑A22B open‑sourced (Apache‑2.0) with Visual Agent, Visual Coding and 256k→1M context
Alibaba released Qwen3‑VL‑235B‑A22B in Instruct and Thinking variants under Apache‑2.0, touting a Visual Agent that operates GUIs, Visual Coding (screenshot‑to‑HTML/CSS/JS), 256k context scalable to 1M, and 32‑language OCR. The Instruct model targets perception tasks while Thinking boosts STEM/causal reasoning. release thread Hugging Face collection Qwen blog post
- Benchmarks: internal tables show strong wins vs Gemini 2.5 Pro on several vision perception suites; long‑video (2‑hour) and multi‑page PDF support emphasized. benchmarks chart
- Capabilities: advanced 2D/3D grounding, occlusion handling, relative coordinates; Visual Agent scores SOTA on OS World per Qwen. release thread
- Access: weights on HF/ModelScope; chat/API endpoints on Qwen Chat and Alibaba Cloud Model Studio. Model Studio Qwen Chat
Qwen3‑Coder‑Plus upgrade hits 69.6 on SWE‑Bench Verified and boosts TerminalBench
Qwen3‑Coder‑Plus (2025‑09‑23) ships with stronger terminal task handling and safer code generation, posting 69.6 on SWE‑Bench Verified and large gains on TerminalBench under both Qwen Code and Claude Code scaffolds. model update engineer note
- TerminalBench: 40.5 with Qwen Code (↑ from 37.5), 37.5 with Claude Code (↑ from 25.0). model update
- Security posture improved per Qwen; Qwen Code adds multimodal input and sub‑agent support. model update
- Availability: new API id and integration across Qwen Code and partner sandboxes noted by community. engineer note
Qwen3Guard safety series launches (Gen + Stream) with 3‑tier severity and 119‑language coverage
Qwen3Guard debuts as two complementary moderation models: Gen (full‑context analysis) and Stream (per‑token, low‑latency interception) with a 3‑tier scheme (Safe/Controversial/Unsafe), delivered in 0.6B/4B/8B sizes under Apache‑2.0. product overview performance bars tech report
- Coverage: 119 languages; Stream flags unsafe tokens as they’re generated to preempt harmful output. product overview
- Training: tech report describes 1.19M prompt/response samples and “strict vs loose” labeling to derive the Controversial class, plus label distillation to de‑noise. tech report
- Early field notes: some users show leetspeak/academic framing bypasses on Gen—worth monitoring as patches land. bypass demo
Qwen3‑LiveTranslate‑Flash rolls out as a real‑time multimodal interpreter with strong EN↔ZH BLEU
Qwen introduced Qwen3‑LiveTranslate‑Flash for on‑device/streaming interpretation that understands speech, lip motion, and gestures, speaking 10 languages and understanding 18. Charts show top BLEU on EN→ZH and competitive performance across EN↔XX and ZH↔XX vs GPT‑4o/Gemini. model tweet benchmarks chart
- BLEU: EN→ZH 45.3 (lead in chart); strong averages across EN↔XX and XX↔ZH. benchmarks chart
- Use cases: live translation with visual cues (lip/gesture), multilingual speech I/O. model tweet
- Deployment: part of Qwen3 real‑time stack for interpreters and voice agents. model tweet
Meituan’s LongCat‑Flash‑Thinking (560B MoE; 27B active) targets theorem proving, coding and tool use
Meituan unveiled LongCat‑Flash‑Thinking, a 560B MoE with ~27B active parameters, dual‑path agentic reasoning and tool calling via <longcat_tool_call> tags, claiming SOTA slices such as 81.6 Pass@32 on MiniF2F and strong agent/coding benchmarks. model thread
- Domains: math/logic, theorem proving, coding, search+exec; integrates with vLLM/SGLang for deployment. model thread
- Packaging: described as MIT‑licensed with open weights on HF (per announcement). model thread
- Competitive set: results charted against GPT‑5, o3/o4‑mini, DeepSeek, Gemini, Qwen across tool and code suites. model thread
🎬 Video gen with sound and HDR; pipelines go day‑0
Kling 2.5 and Wan 2.5 arrive on fal with day‑0 access; ComfyUI ships Wan 2.5 preview nodes; Luma Ray3 adds 16‑bit HDR and iterative CoT; OmniHuman 1.5 lands. Excludes coding model news (feature).
Kling 2.5 lands on fal with day‑0 access and clear pricing
fal switched on Kling 2.5 for both text‑to‑video and image‑to‑video on day one, highlighting big jumps in dynamics, composition, style and emotion handling. Pricing is straightforward: $0.35 for 5 seconds, then $0.07 per extra second via API/playground. See the launch details in launch post and links to fal’s model pages for text to video and image to video, plus more examples in the blog post.
- Features called out: “dramatically improved dynamic quality,” “breakthrough composition aesthetics,” enhanced style adaptation (incl. anime), stronger subject response and emotion capture launch post.
- Try flows directly on fal with day‑0 access (no waitlist) for both T2V and I2V product links.
- Screenshots and early user confirmations that it’s live ahead of other platforms early access note.
Wan 2.5 goes live on fal with native audio for T2V and I2V
fal added Wan 2.5 with sound on day 0, exposing both text‑to‑video and image‑to‑video endpoints so you can ship clips with synchronized audio immediately availability note. Links point to the new 720p endpoints (example output shown on the model pages) image to video page, text to video page.
- Adds sound generation to video (dialogue, ambient, effects) so you don’t need a separate audio pipeline availability note.
- T2V and I2V playground/API pages are up for immediate testing at fal image to video page, text to video page.
- Complements ComfyUI’s preview nodes (see the separate topic) for workflow builders.
ComfyUI ships Wan 2.5 preview API nodes: 10s 1080p + audio conditioning
ComfyUI rolled out preview API nodes for Wan 2.5 with audio‑visual sync (voices, ASMR, music, SFX), 10‑second videos, 1080p quality, and audio conditioning as an input. Update ComfyUI to 0.3.60 and search for “Wan Text/Img to Video” in API nodes release thread, how to update.
- Example posts show music‑backed videos, character voices, and sound‑effect control to test the AV sync path end‑to‑end music demo, character voice, sound effects.
- Full blog explains node setup, new instruction‑following, and higher fidelity 1080p 24fps outputs how to update.
- Adds a no‑code path to evaluate Wan 2.5’s audio stack before broader API maturity.
Luma Ray3 debuts 16‑bit HDR video and iterative self‑critique generation
Luma’s Ray3 introduces first‑class 16‑bit HDR support and a chain‑of‑thought‑style iterative loop that analyzes intermediate generations for prompt adherence before finalizing outputs. Artificial Analysis is adding Ray3 to their arena for comparisons model overview, arena reference.
- HDR unlocks richer highlights/shadows; Ray3 can also convert SDR sources to HDR model overview.
- The iterative “think then improve” loop aims to reduce prompt drift on both text‑to‑video and image‑to‑video model overview.
- API access isn’t available yet; use Luma Dream Machine, with third‑party evaluations coming soon arena reference.
Day‑0 video stack roundup: Kling 2.5, Wan 2.5 (sound), and ComfyUI nodes align
Production pipelines get easier as Kling 2.5 arrives on fal, Wan 2.5 launches with sound on fal, and ComfyUI ships preview Wan 2.5 nodes for AV sync and 1080p. Together these cover T2V/I2V, native audio, and no‑code composition for rapid iteration kling launch, wan availability, comfyui release.
- Kling pricing: $0.35 per 5s then $0.07/s; API and playground pages for both modes text to video, image to video.
- Wan 2.5 sound support lands on fal alongside ComfyUI nodes with audio conditioning and 10‑second flows t2v page, blog post.
- Teams can mix‑and‑match: prompt in fal for speed, block in ComfyUI for control, and compare with Luma Ray3 HDR in evals arena reference.
Higgsfield offers unlimited Kling 2.5 and a “Reasoning Engine” for directed shots
Higgsfield announced day‑0 unlimited access to Kling 2.5 inside its platform and pitched a Reasoning Engine that steers shot intent (camera moves, lighting, acting) so outputs feel directed rather than purely generated platform update, feature thread.
- Creator walkthroughs show style versatility from anime to cinematic realism and smarter subject recognition style examples.
- The flow: pick Kling 2.5, drop an image, add a prompt; the system handles the rest usage steps.
- Promo running through Sep 30 with “unlimited” plans; partner links provided for trials offer link.
Qwen Image Edit Plus expands to fal and Replicate with ControlNet and fast turns
Edits now run in ~6 seconds on Replicate and ship with multi‑image composition, better identity retention, and native ControlNet (depth/edges/keypoints). This follows the core release we covered yesterday initial release, adding broad distribution plus clear pricing and playgrounds replicate page, fal launch.
- Multi‑image blends like person+product+scene, plus strong text/font edits and pose‑consistent faces fal launch.
- fal playground/API is live with commercial use and $0.08 per megapixel pricing fal playground.
- Replicate collaboration cites ~6s edit latency via Pruna‑optimized serving replicate page.
OmniHuman 1.5 hits fal day‑0: image+audio→video with tight emotion/motion sync
fal enabled day‑0 access to ByteDance’s OmniHuman v1.5 so you can drive character videos from a single image and an audio track. It promises crisper lip‑sync, expressive motion, and support for cartoons/non‑human characters launch post.
- Works from a still image plus audio; supports varied character types beyond humans launch post.
- Live playground/API with cost shown ($0.16 per second) for budget planning model page.
- Demo links provided for quick trials across formats demo links.
📊 Real‑world evals: compile, creative, long‑context
Mixed eval drops: CompileBench under dependency hell, creative writing deltas, and long‑context retrieval boards; ARE/Gaia2 chatter continues but was covered prior. Excludes GPT‑5‑Codex launch (feature).
CompileBench now shows GPT‑5‑Codex‑High and Claude‑Sonnet‑4‑Thinking tied at 91% pass@1 (93% pass@3) on messy real‑world builds
New results on CompileBench put GPT‑5‑Codex‑High and Claude‑Sonnet‑4‑Thinking‑16k neck‑and‑neck at the top (91% pass@1, 93% pass@3), across end‑to‑end OSS compiles under dependency hell—following up on launch leaderboard which introduced the suite and tasks. See the updated ranking and task mix under cross‑compile and legacy resurrection conditions. ranking chart
- Tied leaders: Claude‑Sonnet‑4‑Thinking‑16k and GPT‑5‑Codex‑High at 91/93, with GPT‑5‑High at 87/93 and Claude‑Opus‑4.1‑Thinking at 80/100 close behind. ranking chart
- Benchmark scope stresses “works in the real world”: toolchains, patching, flag hunts, Windows/ARM64 cross‑compiles, and 22‑year‑old code resurrections. benchmark note, and benchmark post
- Practical takeaway for agents: compile‑time scaffolds, package resolution, and CI‑style retries matter as much as raw model IQ in closing the last‑mile gap. benchmark note
Context Arena adds Grok‑4‑Fast (Thinking): strong ≤128k, steep drop beyond; placed #6–#15 across 2/4/8‑needle boards
Long‑context retrieval boards were updated with xAI’s Grok‑4‑Fast (Thinking). It posts competitive scores up to 128k tokens but degrades quickly at 256k+ and 1M, suggesting practical use ≤128k despite a 2M window. leaderboard update
- Placements: #15 on 2‑needle @128k AUC, #7 on 4‑needle @128k, #6 on 8‑needle @128k; rapid falloff beyond 128k highlighted in the per‑bin bars. leaderboard update
- Guidance: avoid relying on nominal max context—segment or summarize above 128k to preserve retrieval fidelity. leaderboard update
- Full results and AUC curves are available on the public board. arena site
Creative Writing Benchmark: Grok‑4‑Fast edges Grok‑4 but trails leaders; detailed failure modes shared
An updated run on the LLM Creative Writing Benchmark shows Grok‑4‑Fast performing slightly better than Grok‑4, while still well below the top cluster. The evaluator thread breaks down strengths (clear orientation, image‑led closure) and weaknesses (conceptual endings without paid cost, plateaued escalation). benchmark chart
- Executive profile notes: strong setting/POV control; weaker on dramatized sacrifice and late‑scene escalation. executive profile
- Suggested fixes: make cost visible at closure, add mid‑scene narrowing, prune abstract labels at climax. executive profile
- Benchmark and rubric available for reproducible evals across story elements. benchmark repo
LMSYS Text Arena adds DeepSeek‑V3.1‑Terminus; community reports cleaner language consistency vs V3.1
DeepSeek‑V3.1‑Terminus and its thinking variant are now live in the LMSYS Text Arena for head‑to‑head battles. Early testers highlight improved CN/EN mixing and more stable tool‑use behavior relative to V3.1. arena addition
- Availability: both chat and reasoning modes are listed for blind‑pair comparisons against top systems. try here
- Vendor notes cite strengthened Code/Search Agents and reduced odd characters; API/news page includes context length and pricing details. DeepSeek news page
- Useful for practitioners validating real‑world dialog stability beyond static benchmark scores. arena addition
🧩 Interop: Remote MCP, OAuth, browser actions
Tooling to wire agents to apps/data: GroqCloud remote MCP beta, Warp adds OAuth for Figma MCP, and Edge Copilot tests "Browser Actions". Excludes GPT‑5‑Codex news (feature).
GroqCloud launches remote MCP (beta) to wire agents to external tools over an OpenAI‑compatible API
Groq turned on remote MCP support in GroqCloud (beta), letting OpenAI‑style agents connect to GitHub, browsers, databases and more via the Groq Responses API, with an emphasis on faster runs at lower cost. The beta mirrors OpenAI’s remote MCP spec so existing scaffolds can switch with minimal code changes. launch thread Groq blog
- OpenAI‑compatible Responses API plus remote MCP, so tools like BrowserBase, Firecrawl, Exa and Hugging Face can be called from hosted models. Groq blog
- Cookbook demos show end‑to‑end patterns (tool approval, web agents, search+scrape loops) for quick adoption. details post
- Supported at launch across multiple hosted models (e.g., GPT‑OSS‑120B/20B, Llama 4 variants, Kimi K2, Qwen 3) with drop‑in migration promised. Groq blog
- Positioning is interop and economics: agent actions execute on GroqCloud to reduce latency/expense while remaining API‑compatible. launch thread
Microsoft tests Copilot “Browser Actions” in Edge to automate site navigation with profile context
Edge is surfacing a hidden Copilot setting that lets the assistant browse and complete tasks using your Edge profile info—hinting at first‑party agentic browsing (form fills, navigation, site actions) similar to Perplexity’s Comet and Chrome+Gemini experiments. feature write‑up Testingcatalog post
- The control lives under Copilot privacy settings; no public rollout timeline is noted yet. feature write‑up
- Early language implies access to cookies/profile‑scoped data for higher success on authenticated flows (with attendant security/privacy questions). full scoop
- If shipped broadly, Edge becomes a native agent runtime—reducing reliance on third‑party browser automation for many user workflows. feature write‑up
Warp adds OAuth for MCP and documents Figma’s remote MCP integration
Warp shipped OAuth support for MCP and highlighted a new remote MCP server from Figma, tightening the loop between design artifacts and terminal‑based agents. Teams can now register SSE/CLI MCP servers in Warp, manage startup behavior, and route secure tool calls with OAuth flows. feature note docs update
- MCP servers can be added via Warp Drive with JSON config (CLI or SSE) and started/stopped per workspace; logs aid debugging. Warp docs
- The Figma remote MCP server bridges design data and agent tools, enabling agent actions against design systems with user consent. feature note
- OAuth reduces key exposure for shared agent setups, especially helpful in multi‑tool, multi‑org terminals. feature note
OpenRouter debuts OpenAI‑compatible Responses API (alpha) with reasoning, tools and web search
OpenRouter released an alpha Responses API that mirrors OpenAI’s surface while adding provider‑agnostic routing, tool calling (parallel), and optional web search—all via a stateless endpoint that accepts full conversation history. Early users can earn a $10 credit with feedback. alpha announcement docs link credit offer
- Reasoning knobs and tool schemas are supported; the alpha may introduce breaking changes as features stabilize. alpha announcement
- Designed for easy drop‑in with OpenAI SDKs while enabling multi‑model backends and portability. Responses API docs
- Encourages consolidation of agent loops (reasoning, tools, search) behind one interop layer to avoid hard provider lock‑in. alpha announcement
🧪 Decoding, scheduling and low‑latency stacks
Inference/runtime advances today: step‑level Lookahead Reasoning for SD, Ollama’s new model scheduler, and vLLM cudagraphs. Excludes MCP/agent orchestration items.
Lookahead Reasoning stacks step‑level speculation on SD, hitting up to 2.11× speedups
A new NeurIPS’25 method, Lookahead Reasoning (LR), adds step‑level speculation on top of token‑level Speculative Decoding (SD) to speed up reasoning models while preserving accuracy release thread.
- Draft model proposes future reasoning steps; the target model regenerates each step in parallel and a verifier checks semantic correctness (not token identity) method diagram, project blog.
- Stand‑alone LR delivers 1.04×–1.71× speedups with minimal accuracy change (+1.0/−2.1%), and combining LR+SD lifts speedups to 2.11× on GSM8K/AIME performance plot, ArXiv paper.
- Benefits scale with higher GPU throughput (H100/H200/B200/Rubin), overcoming SD’s long‑draft ceiling by operating at reasoning‑step granularity bottleneck note.
- Code is available for immediate testing and integration with existing speculative decoders GitHub repo.
Ollama ships smarter model scheduler: fewer OOMs, better multi‑GPU utilization
Ollama introduced a new model scheduling engine that reduces out‑of‑memory crashes, improves multi‑GPU performance, and reports memory usage accurately—aimed at faster, more reliable local inference release thread.
- Accurate per‑model memory accounting enables tighter packing and higher prompt/token throughput, especially for long‑context and image‑input models blog post.
- Multi‑GPU scheduling boosts parallelism while avoiding thrash; users can update today for Mac/Win/Linux download.
- The scheduler is on by default for popular models (e.g., gpt‑oss, llama4, gemma3, qwen3) with more to follow, per release notes blog post.
vLLM turns on full CUDA graphs by default to accelerate low‑latency serving
vLLM enabled full CUDA graphs by default, reducing kernel launch overheads and boosting throughput for short‑prompt, low‑latency inference workloads maintainer note.
- Expect noticeable latency gains for interactive use cases (chat, autocomplete) where dispatch overhead dominates.
- Change lands upstream; no code changes required for most deployments beyond upgrading to the latest vLLM build maintainer note.
- Complements other runtime wins (e.g., batch‑invariant sampling) reported recently across the inference stack.
Amp Tab rolls out by default: fast, cross‑file completions driven by diagnostics
Sourcegraph’s Amp Tab is now on by default for new installs, offering a free, instant completion engine that propagates edits across files based on recent changes and compiler errors—no agent loop required feature brief.
- Uses a custom model that understands your current diff, diagnostics, and semantic context to suggest single‑ or multi‑line edits where they’re needed completion behavior.
- Cross‑file updates: change a call signature or interface and Tab through suggested fixes across the codebase without hunting references cross‑file fixes.
- Works in VS Code and compatible editors; existing users can enable via the command palette (“Enable Amp Tab”) usage tip, product page.
🗣️ Live voice agents and proactive mobile stacks
Gemini Live API ‘native audio’ gets more reliable function calling, interruptions and affective dialog; Flowith 2.0 ships proactive voice agent on iOS; hints of ChatGPT video Q&A. Excludes media T2V (covered elsewhere).
Gemini Live adds native audio with 2× more reliable function calls
Google upgraded the Gemini Live API with a native‑audio model that doubles single‑call function‑calling reliability and makes voice agents feel more human via better interruption handling, barge‑in, and affective dialog. The preview model ships as gemini‑2.5‑flash‑native‑audio‑preview‑09‑2025. update card, product note, docs thread
- Native audio vs half‑cascade: direct speech‑in/speech‑out improves timing and tone; a fallback “half‑cascade” uses TTS for more predictable production flows. See modes in Live API docs.
- Conversation quality: higher accuracy for pausing/resuming, better side‑chatter filtering, and more natural VAD/barge‑in behavior. docs thread
- Tool use: sturdier function calling (2× reliability in single‑call tests) and planned “thinking” budget to gate deeper reasoning on harder turns. docs thread
- Connectivity and auth: client WebSocket streaming with ephemeral tokens for mobile/browser; server‑to‑server supported when backend control is needed. deep dive thread
- Try it now via AI Studio Live; model IDs and sample applets are linked from the docs. AI Studio live, function calling applet
Flowith 2.0 ships proactive voice agent on iOS with Knowledge Garden and Dozer
A new Flowith release brings a proactive voice agent to iPhone and iPad that “listens, then acts” based on conversational context, tied to a revamped liquid‑glass UI, a graph‑like Knowledge Garden, and faster streaming. launch thread, feature post
- Proactive agent: acts before taps by grounding in recent context, and routes across chat, images, and video with one subscription. launch thread
- Dozer transcription: real‑time voice capture with long recordings, speaker diarization, and direct Knowledge Garden integration. feature brief, review thread
- Workflow polish: multi‑message quoting, sharing conversations as links, drag‑and‑drop files, and character‑level streaming for snappier UX. feature brief
- Availability: iOS/iPadOS live now; App Store listing details supported devices and privacy posture. App Store page
Strings hint ChatGPT mobile will add short‑video capture with voice Q&A
Not a launch announcement, but localized strings in the ChatGPT app suggest a feature to “capture videos and ask your question out loud for a faster answer,” pointing to real‑world, multimodal troubleshooting on device. localization leak, mobile strings hint
- Likely pipeline: frame sampling + on‑device vision (objects/text/motion) + speech‑to‑text, fused into a multimodal encoder for grounded answers. feature analysis
- Use cases: point‑and‑ask for devices, whiteboards, forms; faster than typing long descriptions while preserving visual context. feature analysis
- Caveat: no rollout timing disclosed; treat as an early signal from app resources rather than a confirmed release. mobile strings hint
🧠 Hierarchical retrieval and agentic search stacks
Fresh retrieval research and stacks: long‑distance recall recipe for hierarchies and production agentic retrieval engines. Prior MetaEmbed was yesterday; scope narrowed to new drops today.
DeepMind’s hierarchical retrieval recipe lifts far‑ancestor recall to 76% with tiny embeddings
A pretrain→finetune dual‑encoder schedule fixes the “lost‑in‑the‑long‑distance” problem: WordNet long‑distance recall rises from 19% to 76% at just 16 dims, while overall recall climbs at higher dims without breaking near matches wordnet results, and the approach generalizes beyond synthetic trees paper explainer.
- Asymmetric query/doc embeddings + softmax loss (τ≈20) model one‑way relevance (ancestor is relevant to child, not vice‑versa) training loss.
- Two‑stage schedule: pretrain on regular pairs, then finetune on far pairs with ~1000× lower LR and high temp (≈500) and early stopping to protect near slices finetune recipe.
- At 64 dims: overall recall reaches ~92.3%, worst‑distance slice ~75.7%, showing strong long‑range gains without harming close ancestors wordnet results.
- Practical tip: any proxy that separates near vs far pairs works for the finetune; no exact tree distances needed finetune recipe.
- Full details and proofs on dimension scaling with depth and log‑size in the paper ArXiv paper.
Shopee’s OnePiece lifts GMV/user by 2% bringing context engineering and reasoning to ranking
Deployed in Shopee’s personalized search, OnePiece integrates structured context engineering, block‑wise latent reasoning and progressive multi‑task training—delivering +2.0% GMV per user and +2.9% ad revenue online paper page.
- Structured context engineering enriches inputs with user history and preference signals under a schema paper page.
- Block‑wise latent reasoning refines representations step‑by‑step, improving retrieval + ranking robustness.
- Progressive multi‑task training leverages user feedback to supervise intermediate reasoning states, not just final clicks ArXiv paper.
Perceptron debuts TensorStream to schedule interleaved multimodal events for omni models
Perceptron open‑sourced TensorStream, a tensor‑like interface that organizes video, audio, text and metadata into a single prioritized event stream for training/inference with omni models. It tackles token accounting and timing by separating data from descriptors and tracking dims_real/virtual and measurements project blog, Perceptron blog.
- Event scheduling with per‑modality priorities (randomized at load) yields diverse tasks from the same sample while keeping a unified stream.
- Clean token accounting: preprocessors update measurements so downstream components can budget context precisely.
- Built to run both training and realtime inference on the same abstraction, easing agent pipelines that juggle vision, audio and text.
ARK‑V1: lightweight KG agent hits ~70–74% overall on long‑tail QA with multi‑hop traversal
A simple knowledge‑graph agent that loops: pick anchor → choose relation → fetch triples → write a reasoning step—achieves ~70–74% overall with 94%+ conditional accuracy on long‑tail QA using larger backbones (e.g., GPT‑5 Mini, Gemini 2.5 Flash) and ~70% with mid‑scale models performance results, paper notes, ArXiv paper.
- Especially effective on rare entities where parametric memory fails; traversal grounds answers in explicit KG facts paper notes.
- Deterministic loop produces step‑by‑step natural‑language inferences after multi‑hop traversal, then summarizes a final answer how it works.
- Weak spots: ambiguity, conflicting triples, or missing commonsense—agent can over‑trust the KG; future work targets smarter prompting/efficiency limitations.
Agentic retrieval meetup surfaces Hornet patterns for iterative, schema‑first search loops
At an SF panel, teams compared old keyword stacks vs agent‑first retrieval, highlighting schema‑first APIs, predictable query plans, and token‑aware loops near your data—building on yesterday’s Hornet debut initial debut—with a call to tune BM25+hybrid baselines for serving readiness event photo, Hornet for agentic retrieval.
- Iterative and parallel retrieval loops benefit from explicit schemas (less tool‑fail and fewer wasted tokens) event photo.
- Keep the engine by your agents/data for latency and cost control; reserve LLM calls for synthesis, not brute‑force recall.
- Evals: measure recall under long, structured queries and failure recovery, not just single‑shot similarity.
🛡️ Guardrails and governance: streaming moderation & red lines
New safety models and governance calls: Qwen3Guard moderation (streaming + 3‑tier severity), UN‑linked ‘AI Red Lines’ letter; practical injection concerns raised. Excludes Responses API safety notes from feature.
Qwen3Guard ships streaming, multilingual AI safety guardrails under Apache-2.0
Alibaba’s Qwen team launched Qwen3Guard in 0.6B/4B/8B sizes across two variants—Gen (full‑context) and Stream (per‑token)—supporting 119 languages and a three‑tier verdict: Safe, Controversial, Unsafe model release.
- Stream does live, token‑level checks to halt unsafe output as it’s generated; Gen performs full‑context post checks with richer analysis model release.
- Bench snapshots show strong prompt/response classification across English, Chinese and multilingual sets (Qwen3Guard‑Gen‑8B leading most bars) performance chart.
- Open‑sourced under Apache‑2.0 with both per‑token and full‑context architectures documented, including training mix and labeling strategy release note, tech report.
- Training and data curation emphasize multilingual safety and gray‑area handling (Controversial tier), aiming to reduce over‑blocking while catching borderline content stream explainer, build details.
Early tests show simple bypasses for Qwen3Guard’s filters
Despite the launch, practitioners quickly surfaced cases where leetspeak and “academic context” framing slipped past safeguards, underscoring the hard trade‑offs between recall and precision in moderation bypass example.
- Screenshots show prompts like “how to make m3th… academic” and “ACADEMIC CONTEXT… r1c1n” being marked Safe, suggesting gaps in obfuscation defenses bypass example.
- Streaming guards can mitigate by intervening token‑by‑token, but policy lines (Safe vs Controversial) still need continual tuning stream explainer.
- The Qwen3Guard design explicitly models a middle “Controversial” tier via strict/loose label disagreement to capture gray areas—useful, but also a potential source of leniency if mis‑tuned controversial labeling, model release.
Email agents raise new exfiltration vectors as Perplexity assistant rolls out
Following up on launch, developers are flagging prompt‑injection and data‑exfiltration risks for autonomous email flows (e.g., calendar invites with embedded sensitive context), as the Perplexity assistant now drafts replies, schedules and prioritizes threads for Max users feature brief, security question.
- The assistant can read and act across Gmail/Outlook; without robust tool gating and content filtering, crafted emails could induce unsafe actions or leak private data feature brief.
- Mitigations to consider: allow‑lists for actions/recipients, red‑team test corpora with injection strings, human‑in‑the‑loop on high‑risk changes, and structured render‑safe templates for generated invites security question.
Prompt‑injection remains trivial via social platforms
Practitioners note it’s “delightful how easy” to mount injections via LinkedIn and DMs—another reminder that agentic systems must treat external text as hostile unless verified LinkedIn demo, DM jailbreak.
- Social surfaces combine high trust cues with arbitrary text, raising the odds that template‑based or lightly‑guarded agents will execute attacker intent LinkedIn demo.
- Embed‑side mitigations: strict tool allow‑lists, render‑safe markdown/HTML, instruction firewalls, and provenance tags; runtime mitigations: multi‑model checks and adjudication for risky intents DM jailbreak.
💼 Enterprise adoption, pricing and ops updates
Enterprise‑relevant moves: The Washington Post on Together Dedicated Endpoints, Microsoft UK AI plan, productized email agent rollouts. Excludes infra megaprojects (covered separately).
Microsoft sets $30B UK plan through 2028, including 23k+ GPU supercomputer and AI skills push
Microsoft will invest $30B in the UK by 2028, allocating $15B to AI infrastructure and planning an Nscale supercomputer with 23,000+ GPUs, alongside datacenter expansion, support for 6,000 staff, and training for 1M+ people in AI skills. UK plan brief
- The infrastructure build and workforce programs signal more regional supply for training/inference and a broader enterprise AI talent pipeline. UK plan brief
- A 23k+‑GPU Nscale system positions the UK for large‑scale model training and advanced inference workloads, easing cross‑region capacity constraints for multinationals. UK plan brief
Washington Post moves 1.79B tokens/mo on Together’s Dedicated Endpoints with ~2s latency
The Washington Post standardized “Ask The Post AI” on Together AI’s Dedicated Endpoints, processing 1.79 billion tokens per month with consistent ~2‑second responses and fixed monthly pricing. The posture is open‑model (Llama/Mistral) with full control and zero proprietary API lock‑in. case study post
- Fixed‑price, predictable ops while scaling to newsroom traffic peaks; details in the customer story. results summary How The Post did it
- Dedicated endpoints keep model ownership and tuning in‑house versus fully managed black‑box APIs. customer quote
- Architecture choice targets operational reliability (2s responses) plus governance over data and models, relevant for regulated media and enterprises. case study post
Gamma launches automation API for content creation and adds Business/Ultra pricing tiers
Gamma introduced an API to fully automate deck/blog/social content creation from internal data and workflows, alongside new Business and Ultra plans. It’s aimed at integrating into automation tools or direct data sources to scale marketing and comms ops. api summary
- Example flows: turn meeting notes into decks plus follow‑up emails; auto‑create/schedule LinkedIn carousels. use cases
- New plans: Business (team features, branding, priority support) and Ultra (unlimited premium features). new plans
- Pitch: position as a visual‑communication platform that can now be fully automated via API for enterprise teams. positioning Gamma site
🦾 Home humanoids and field trials
Robotics capital and deployments: 1X reported $1B raise target for soft‑sided home humanoid NEO Gamma; Unitree G1 "Anti‑Gravity" mode update demos.
1X seeks $1B to scale soft‑sided home humanoid NEO Gamma at ~$10B valuation
Humanoid startup 1X is targeting up to $1B in new funding at a ~$10B valuation to ramp NEO Gamma (home) and expand EVE (industrial/security), aiming for broad home rollout by December 2025. The stack blends OpenAI voice with a 160M‑parameter Redwood VLM controller, combining teleop with RL and a world‑model for safer autonomy. funding thread
- Early fielding plan: “a few hundred to a few thousand” NEO units in homes by late‑2025, following SF‑area trials; EVE previously deployed 150–250 units in U.S. night‑guard pilots via ADT’s Everon. deployment recap
- Soft exterior, tendon‑driven arms, kneeling/sitting mechanics for tight, human‑proximate spaces; focus on tidying, deep cleaning, conversational help. funding thread
- Control stack: OpenAI voice commands → Redwood controller (160M, 5Hz onboard) → motion plans; teleoperation catches edge cases and feeds training data to improve autonomy. deployment recap
- Safety and reliability: remote assist plus a predictive world model that evaluates outcomes before executing on hardware; mobility RL covers stairs, kneeling, sitting, stand‑ups. deployment recap
- Manufacturing and scale: Palo Alto HQ sized for ~400 staff; in‑house actuators and assembly in Moss, Norway; peers raising aggressively (e.g., Figure). funding thread