Alibaba Qwen3 Max GA – Intelligence Index 55 and 62.4% SWE‑Bench

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Executive Summary

Alibaba’s Qwen3 Max graduates to general availability with measurable gains where it counts. Intelligence Index rises +6 points to 55, tool‑use jumps on τ² Telecom from 33% to 74%, and coding lifts show up on LiveCodeBench and SWE‑Bench. Pricing lands clean at $1.2/$6 per 1M tokens with a 256k‑token context—an aggressive price/perf bid against frontier peers.

In numbers:

  • SWE‑Bench Verified: 62.4% (#4); $1.66 cost per test; 2,804s median latency
  • Intelligence Index: 55 (+6 points) per Artificial Analysis’ latest report
  • LiveCodeBench: 77% accuracy, up from 65% in prior snapshots
  • Tool use: τ² Telecom climbs from 33% to 74% on evaluation
  • Context window: 256k tokens; text‑only model configuration
  • Pricing: $1.2/M input tokens; $6/M output tokens
  • Availability: Qwen Chat and Alibaba Cloud; public comparisons and token‑use analytics live

Also:

  • Ollama Cloud: Kimi‑K2 1T and DeepSeek V3.1 671B endpoints with free trials
  • Gemini 2.5 Flash: ranks #3/38 on MMMU and #6/20 on SWE‑Bench; roughly half cost of peers

Feature Spotlight

Feature: AI feeds go proactive (Pulse, Vibes)

Proactive AI feeds arrive: ChatGPT Pulse and Meta’s Vibes push assistants from chat to daily, personalized streams—raising UX, privacy and monetization stakes for AI apps.

Cross‑account chatter focused on ChatGPT Pulse day‑2 reactions and Meta’s Vibes web rollout (incl. EU). Mostly product takes on proactive assistants, personalization, and likely ads. This is separate from model/eval updates elsewhere.

Jump to Feature: AI feeds go proactive (Pulse, Vibes) topics

📑 Table of Contents

🗞️ Feature: AI feeds go proactive (Pulse, Vibes)

Cross‑account chatter focused on ChatGPT Pulse day‑2 reactions and Meta’s Vibes web rollout (incl. EU). Mostly product takes on proactive assistants, personalization, and likely ads. This is separate from model/eval updates elsewhere.

ChatGPT Pulse day‑2: mixed reviews on depth, sources, and UX

Early Pro users say the new proactive feed feels shallow and work‑skewed, with limited scroll and some source gaps, while others already see useful, well‑formatted cards for daily follow‑ups. See initial impressions in user review, feature notes in feature note, and concrete cards in work cards.

Pulse work cards

  • Substack content is currently blocked by the in‑app browser, hampering certain cards; users are asked to paste text instead substack block.
  • Several want a podcast‑style read‑aloud to turn the morning brief into a commute companion feature note.
  • Some report push notification or refresh bugs on mobile, requiring app restarts to view the latest Pulse user review.
  • Not all feedback is negative—power users praise the “proactive ephemeral UI” when Pulse composes ready‑to‑send drafts from connected data positive take.

Meta’s Vibes AI video feed goes live on the web, including EU

The Vibes feed—short AI‑generated videos with remix tools and cross‑posting—has started rolling out on the web (EU included), following the preview last night preview. Users can browse clips, personalize, and jump straight into creating or remixing from the feed itself per early web captures in vibes modal and feed preview, with feature rundowns in feature summary and confirmations in web link.

Vibes modal

  • Feed positions creation as the next step: start from scratch, remix visuals/music/styles, then share to DMs, Stories/Reels feature summary.
  • Rollout notes point to partners like Midjourney/BFL powering generation, plus creator‑focused personalization tools coming to select accounts vibes modal.

Debate flares: Are AI feeds ‘creation tools’ or ad‑ready content streams?

A lively back‑and‑forth emerged around framing and monetization: some argue Pulse/Vibes should be sold as creation workflows, not algorithmic feeds, while others warn Pulse could become an ad surface quickly. See ad concerns in ads concern and replies in user response; framing critiques in messaging take.

Pulse ad concern

  • Critics suggest calling these “create/collaborate” canvases would land better than “feeds,” which implies passive consumption messaging take.
  • Pulse’s work‑centric cards led some to compare it to human‑curated timelines and question whether assistant‑written feeds will win attention long‑term attention debate.

xAI’s Grok eyed as X’s AI curator, with voice‑first app and translate assists

As OpenAI and Meta push assistant‑led feeds, xAI is positioning Grok to curate the X timeline, pairing a voice‑first mobile app with features like Autotranslate that broaden what users can consume. See the curation angle in attention debate, the voice UI in voice UI, and translation utility callouts in grok translate.

Grok voice UI

  • Proponents argue assistant‑led curation inside an existing social graph could outperform standalone assistant feeds by marrying discovery with conversations grok translate.

🧰 Agentic coding and dev tooling

New CLI/IDE agents, sub‑agents, local stacks and security modes. Today adds Warp CLI, Factory subagents, Codex 0.42 secure mode, Cline 1M‑ctx stealth model. Excludes MCP ‘Code Mode’ (see orchestration).

Claude Code introduces subagents that act like a coordinated dev team

Anthropic rolled out subagents in Claude Code, with roles like debugger, tester, and refiner collaborating in sequence on tasks feature clip.

Claude Code api‑key prompt

  • The pattern encourages narrow expertise per agent and predictable handoffs for complex fixes and test loops feature clip.
  • Complements the recent slash‑command workflows by turning multi‑step sessions into reliable, repeatable pipelines for teams feature clip.

Cline’s stealth model gets a 1M‑token context window for giant repos

Cline unveiled code‑supernova‑1‑million, expanding its stealth model’s context from 200k to 1M tokens while keeping multimodal support and free alpha access blog post. This arrives after v3.31 update trimmed UX and added voice and YOLO auto‑approve.

  • Targets monorepos, cross‑service refactors, long log debugging, and doc‑heavy sessions that previously blew past limits release blog.
  • Minimizes context pruning and re‑indexing overhead during long‑running agent sessions blog post.

Factory CLI adds subagents/custom Droids for repeatable coding workflows

Factory now lets you define subagents (“custom Droids”) that specialize and hand off work, turning common tasks into one‑command flows feature brief.

Factory terminal run

  • Built‑in recipes span commit‑to‑notes, code security audits, support docs, UI drafts, and multi‑step debugging feature brief.
  • Early users report stronger terminal automation with pre‑packaged task graphs and guardrails announcement, and live testing streams are planned this weekend live demo.

Warp ships a CLI to run its agents from any terminal, remote host, or CI

Warp introduced a CLI that decouples its agent runtime from the Warp app, enabling custom integrations, remote/CI runs, and Slack hooks release post.

  • Supports agent profiles, prompts, and MCP servers via subcommands; install and usage details are in the docs CLI docs.
  • Opens the door to standardized agent ops (repeatable runs, logs, exit codes) in existing developer toolchains without changing shells release post.

Codex 0.42.0 adds secure mode, safer exec, and a revamped /status

The latest Codex CLI release ships CODEX_SECURE_MODE=1, safer command execution and clearer exec events, plus a more informative /status readout release notes.

Release notes screenshot

  • Early Rust SDK‑based MCP client lands in preview, pointing to tighter tool integration release notes.
  • Security‑first toggles and better visibility reduce foot‑guns when letting agents touch the shell or external tools release notes.

Local stack: Cline + LM Studio + Qwen3 Coder 30B on Apple Silicon

A step‑by‑step guide shows Cline driving LM Studio with Qwen3 Coder 30B for a fully local coding agent—no cloud keys, offline, and optimized for MLX blog post.

  • Recommends 262k context, disabling KV cache quantization, and using Cline’s compact prompt for local efficiency blog post.
  • Offers a practical path for Mac developers to prototype agentic coding without inference bills or data egress local setup.

🧩 MCP in production (Code Mode and content ops)

Interoperability and server patterns. New ‘Code Mode’ proposals translate MCP specs to TypeScript APIs for agent‑written code. Cloudflare teases site MCP servers via AI Index. Excludes general coding tools covered elsewhere.

Cloudflare teases AI Index: built‑in MCP servers for your site

Cloudflare previewed an AI Index feature that auto‑exposes a website as an MCP server, complete with NLWeb tools for natural‑language queries; agentic apps could connect directly to your content with no custom glue feature screenshot. This lands in context of crawler license, which let millions of sites control AI access without breaking search.

AI Index MCP note

  • What’s shipping: when your AI Index is set up, you “get a set of ready‑to‑use APIs” including an MCP server that standardizes agent access to your site feature screenshot, Cloudflare blog post.
  • Why it matters: content ops move from bespoke scrapers to a standard MCP surface, reducing integration drift and enabling safer, auditable agent access paths Cloudflare blog post.
  • Adoption sentiment: builders expect MCP to “explode” when it’s as easy as Zapier—platform‑level provisioning is a big step toward that adoption comment.

MCP Code Mode turns tools into TypeScript APIs

Instead of letting models call MCP tools directly, Code Mode wraps those tools behind a TypeScript API; the model writes code, the agent runs it in a sandbox, and the code calls MCP. Teams report fewer schema errors, lower token usage, and easier review/reuse compared with raw tool-calling Cloudflare brief, and the first-party write‑up walks through the pattern end‑to‑end Cloudflare blog post.

Code Mode flow

  • The loop: expose MCP as typed functions, have the LLM author a small script, then execute it to orchestrate multi‑step/conditional logic without ballooning tokens Cloudflare brief.
  • Practitioner reaction highlights a practical upside—models are better at writing code than emitting JSON tool schemas—so code becomes the lingua franca for orchestration @mattshumer comment.
  • Nuanced critique: if your client code gets too complex, that’s a signal to redesign the MCP server around higher‑level functionality, not raw API passthroughs design critique.
  • Review benefits: code diffs are auditable, reusable across runs, and can batch/retry intelligently (loops, guards) without tool‑call chatter Cloudflare blog post.

🧪 New and upgraded models

Mostly media and platform adds plus one major eval lift. Today: ByteDance Lynx I2V on fal, Tencent Hunyuan3D‑Part/Omni releases, Qwen3 Max GA with measured uplifts, Ollama cloud adds Kimi K2 (1T) and DeepSeek V3.1 cloud endpoints.

Qwen3 Max hits GA with measurable uplifts across tool use, coding, and long context

Alibaba’s Qwen3 Max moved to GA with a +6‑point jump to 55 on Artificial Analysis’ Intelligence Index, major gains in tool use (𝜏² Telecom 33%→74%), LiveCodeBench (65%→77%), and AA‑LCR for long‑context reasoning (40%→47%), following arena additions that surfaced head‑to‑head access. See the breakdown in index report and SWE‑Bench snapshot eval table.

SWE‑Bench table

  • Ranked #4 on SWE‑Bench Verified (62.4%) and available in Qwen Chat and Alibaba Cloud eval table, index report
  • 256k context, text‑only, priced at $1.2/$6 per 1M in/out tokens index report
  • Model page with current comparisons and token usage analysis model page

ByteDance Lynx hits fal with single‑photo→video and strong identity lock

A new Lynx endpoint on fal generates smooth video from just one portrait while preserving facial identity—priced at $0.60 per second and ready instantly (no LoRA). See the rollout in release note and try it on the fal model page.

  • One‑photo reference to video with stable identity and motion continuity release note
  • Instant usage via fal playground/API; no personalization step required fal model page
  • Billed at $0.60 per second of output video pricing mention
  • Early dev chatter highlights portrait‑to‑scene consistency across contexts platform update

Tencent open‑sources 3D part modeling and controllable 3D asset generation

Tencent Hunyuan unveiled two complementary 3D systems: Hunyuan3D‑Part (with P3‑SAM and X‑Part) for native part segmentation and diffusion‑based decomposition, and Hunyuan3D‑Omni for multi‑condition controlled asset creation. Code, weights, and demos are live release thread, GitHub repo.

  • P3‑SAM: first native 3D part segmentation trained on ~3.7M shapes with clean annotations (no 2D SAM during training) release thread, P3‑SAM paper
  • X‑Part: diffusion decomposition for structure‑coherent part generation and editing X‑Part paper
  • Omni “ControlNet of 3D”: fuse up to four controls (skeleton, point cloud, bbox, voxel) to fix occlusions and geometry omni thread
  • Inference code and weights released to accelerate research and production workflows omni thread, GitHub repo

Ollama lights up cloud runs for Kimi‑K2 1T and DeepSeek V3.1 671B

Developers can now invoke frontier‑scale models via simple ollama run commands—kimi‑k2:1t‑cloud and deepseek‑v3.1:671b‑cloud—available to try for free during the rollout cloud models. Grab the desktop to get started from the download page.

Ollama download

  • One‑line runs: ‘ollama run kimi‑k2:1t‑cloud’ and ‘ollama run deepseek‑v3.1:671b‑cloud’ cloud models
  • Positions local workflows to tap massive context/capabilities without self‑hosting multi‑GPU clusters cloud models
  • Windows/macOS downloads and setup guidance are live download page

📊 Real-world evals and leaderboard shifts

New results and meta‑debates: Gemini 2.5 Flash gains, ARC PRIZE score jump, Qwen3 Max SWE‑Bench table, browser WebBench claims, plus criticism of Terminal‑Bench task realism. Excludes GDPval (covered yesterday).

Gemini 2.5 Flash posts fresh third‑party gains and ranks, at roughly half the cost of peers

Vals.ai reports +17.2% on GPQA and +5% on TerminalBench for gemini‑2.5‑flash‑09‑2025, ranking #3/38 on MMMU and #6/20 on SWE‑Bench, while pricing stays about half of similarly performing models evals thread, bench delta, pricing note. In context of Flash gains, this adds broader leaderboard placement beyond the initial token‑efficiency claims.

flash update card

  • Flash vs o3/Claude peers: Vals.ai snapshots place Flash’s MMMU rank at #3/38 and SWE‑Bench at #6/20, indicating competitive multi‑modal and coding performance evals thread.
  • Cost profile: Vals.ai calls Flash “half the cost” of comparable models; their note cites $0.30 / $2.50 per token for Flash input/output pricing note.
  • Flash vs Flash Lite: On private legal/finance sets (CaseLaw, TaxEval, MortgageTax), Flash outperforms Lite by ≈10% while the two are similar on many public benches private set gap.

ARC Prize 2025 score jumps to 27.08% as Giotto.ai tops the board

Giotto.ai set a new high score of 27.08% on ARC‑AGI 2025, a sizable leap over prior entries and a fresh signal of progress on interactive reasoning tasks scoreboard.

arc prize board

  • The MIT event showcasing ARC‑AGI‑3 and interactive benchmarks kicks off Sept 27; recording will be posted to the ARC Prize YouTube channel event details, recording info, event page.

Qwen3 Max lands 62.4% on SWE‑Bench Verified with detailed cost/latency snapshot

Vals.ai’s table shows Qwen3 Max at 62.4% on SWE‑Bench Verified, with $1.66 cost/test and ~2,804s median latency—placing it fourth behind frontier reasoning models while offering a clear price/perf trade‑off swe‑bench table.

swe‑bench table

  • Broader context: Artificial Analysis also bumped Qwen3 Max to 55 on their Intelligence Index (+6 points), with visible gains in tool use and coding (LCB 65%→77%) intelligence roundup, complementing the SWE‑Bench readout.

Browser WebBench: Gemini 2.5 Flash matches o3 accuracy while 2× faster and ~4× cheaper

On a 200‑task WebBench‑Read run, gemini‑2.5‑flash‑09‑2025 posted 82.6% accuracy vs o3 at 82.4%, finishing in 105s vs 199s and at ~€5.7 vs €24.3 average cost per task, per the shared comparison chart webbench chart.

webbench bars

  • Caveat: This is a single benchmark slice; results may vary across other agent/browser suites and settings.

Terminal‑Bench realism questioned: “too many contrived tasks,” calls to fix agent evals

Practitioners argue Terminal‑Bench over‑indexes on toy puzzles (e.g., trivial servers) and contrived workflows, making it a negative signal for production agent quality; calls grow for harder, real‑world tasks and better harnesses skeptic take, tasks repo, benchmark rant.

benchmark screenshot


🎬 Video/edit stacks in the wild

Hands‑on pipelines and comparisons dominate: Wan 2.2 Animate character replacement tests across vendors; Kling 2.5 vs Luma Ray3 takes; ComfyUI nodes and showcases. Excludes Veo‑3 research (see research).

ByteDance Lynx lands on fal: single-photo → video with strong ID retention at $0.60/sec

A new fal deployment of ByteDance’s Lynx can animate a single portrait into smooth video without LoRA, while preserving identity—priced at $0.60 per generated second fal announcement, fal model page.

  • Designed for “single photo reference to video,” with motion that keeps facial details intact across scenes fal announcement
  • Instant to try (playground/API), no fine‑tuning required; useful for quick creator pipelines and ad comps try it today
  • The cost model encourages short, iterative revisions; teams can front‑load prompts and dial style before rendering longer takes fal model page

WAN 2.2 Animate: native site and Alibaba ‘pro’ beat fal/Replicate in user tests

Hands-on trials of WAN 2.2 Animate across providers point to better identity retention and cleaner color handling on the native wan.video service and Alibaba’s “pro” setting, with fal (480p) and Replicate trailing on visual fidelity. See the multi-post comparison in wan comparison.

  • Alibaba Wan “pro” delivered the cleanest, colorized replacement and the most consistent face across frames alibaba pro result
  • Replicate’s output was higher-res but oddly left the subject in black-and-white against color backgrounds, hurting realism replicate result
  • fal at 480p produced the weakest look in this run; author suspects resolution contributed to artifacts fal sample
  • Conclusion from the thread: provider and settings matter as much as the base model; try native/pro tiers first for identity‑critical edits wan comparison

ComfyUI hosts Wan 2.2 Animate showcase; Qwen Image Edit 2509 ships with native support

ComfyUI is spotlighting WAN 2.2 Animate with a live session featuring Ingi Erlingsson and Lovis Odin, alongside native support for WAN 2.2 Animate and Qwen-Image-Edit‑2509 announced in the latest blog showcase invite, feature blog. This comes in context of native support, where ComfyUI detailed broader video/image pipelines.

event promo

  • Live event and replay links are available; expect character replacement and motion‑control demos live replay
  • Feature post outlines dual WAN 2.2 modes (animate vs replacement), long‑clip stability, and “replace a character in scene” workflows blog post
  • Qwen‑Image‑Edit‑2509 adds multi‑image editing, identity/product preservation, and ControlNet depth/edge/keypoint integration feature blog

Kling 2.5 Turbo vs Luma Ray 3: quick side-by-side suggests Kling edge on realism

A fast user comparison finds Kling 2.5 Turbo edging Luma Ray 3.0 on perceived realism in matched prompts quick comparison. While anecdotal, it aligns with creator workflows trending toward Kling in longer roundups.

  • Community guides pair Kling with Seedream 4 for cinematic comping and stylization workflow guide
  • Weekly recaps show an uptick in Kling 2.5 creator reels and edits circulating in production‑style stacks community roundup

Quiet on new APIs; focus on retrieval quality and research‑ops. Today: Perplexity “Sonar Testing” reasoning toggle sighting, Mistral Libraries add website indexing for Le Chat, and office hours notes on metadata‑aware RAG.

Perplexity trials “Sonar Testing” with a reasoning toggle for deeper answers

Spotted in the wild, Perplexity is testing a Sonar variant with a reasoning mode designed to tackle tougher queries while keeping its trademark speed for quick lookups model sighting, feature writeup. The UI exposes a simple toggle, hinting at an upcoming path where users choose between instant snippets and slower, chain‑of‑thought style responses.

search quality charts

  • In context of initial launch, this builds on the new Search API by adding a depth control at the model layer rather than just retrieval tuning model sighting, feature writeup.
  • Early chatter frames this as a Sonar upgrade, not a separate product; the emphasis is accuracy, instruction following, and larger contexts without sacrificing low P50 latency feature writeup.
  • Developer interest is immediate; teams are wiring Perplexity search into agent tools already (example TypeScript tool for opencode) tool example, gist code.

Classifier‑as‑Retriever beats MIPS: 99.95% top‑1 with soft embeddings and FL+DP

Numbers first: freezing a small LM, learning “soft embeddings” via a tiny adapter, and replacing dot‑product retrieval with a classifier head yields 99.95% top‑1 on a spam task—crushing a frozen‑encoder+MIPS baseline at 12.36% overview thread, results summary. The approach also plays well with federated learning and differential privacy, trading full fine‑tuning for tiny, sharable parameters.

architecture sketch

  • Recipe: Keep tokenizer+encoder frozen; insert a small adapter to reshape token embeddings; train a classifier that maps query states to document labels (retrieval is learned, not fixed) design picture, classifier detail.
  • Results: Adapter‑only MIPS → 96.79%; classifier‑only → 97.02%; adapter+classifier → 99.95% top‑1; federated setup hits up to 2.62× speedup while matching centralized accuracy results summary, federated speedups.
  • Privacy: clip updates and add Gaussian noise to adapter/classifier gradients; small accuracy hit while maintaining DP guarantees privacy note.
  • Read more: implementation details and ablations in the blog and paper company blog, ArXiv paper.

DeepMind: episodic memory fixes “latent learning” gaps that weights miss

Contrast‑led: instead of jamming more facts into model weights, DeepMind shows that saving past episodes and retrieving them at inference lets models flexibly reuse experiences—reducing reversal failures (e.g., turning “X is Y’s son” into “Y is X’s parent”) without task‑specific training paper summary.

paper title page

  • Latent learning: systems notice useful info during training but fail to keep it accessible; retrieval supplies missing specifics for language, codebooks, and maze navigation paper summary.
  • Technique: fetch a few similar episodes; place them in context; leverage in‑context learning to apply the right path/facts for the new query paper summary.
  • Complementarity: broad patterns remain parametric; rare or directional facts ride in episodic memory—clean fit with modern RAG stacks paper summary.

RAG office hours: blend dense vectors with sparse/structured signals, then route

Implication‑led: treating metadata (timestamps, authorship, relationships) as first‑class signals—alongside dense vectors—materially lifts retrieval quality. The guidance: build evals, segment queries, fine‑tune embeddings per segment, and route to specialized retrievers rather than one‑size‑fits‑all office hour notes.

framework slide

  • Course notes stress a “RAG flywheel”: synthetic eval data → embedding fine‑tuning → feedback UX → query segmentation → multimodal indices → smart routing course page, slides image.
  • Key pitfall is stringifying structure; instead, keep sparse/structured features (BM25, facets, schema fields) distinct so rerankers can learn their weight office hour notes.
  • Practical takeaway: define failure‑focused evals (hard negatives, long‑tail intents) before iterating retrievers—otherwise dot‑product wins by default course page.

🚀 Serving, token efficiency and latency wins

Production tips and infra‑kernel work. New: SGLang best practices on H20 deployments; NVIDIA TRT‑LLM ‘Scaffolding’ with Dynasor token savings; Baseten cuts Superhuman embeddings P95 to 500ms.

SGLang + Ant Group hit 16.5k/5.7k tok/s per H20 node with a hardened DeepSeek‑R1 serve recipe

Ant Group and LMSYS published best practices for serving DeepSeek‑R1 on H20‑96G GPUs, delivering 16.5k input and 5.7k output tokens per second per node, with a hardware‑aware split and kernel upgrades blog post, and LMSYS blog. This lands as a practical continuation of their performance work on newer silicon, following up on GB200 speeds where they reported 26k/13k per GB200.

serving best practices

  • Topology: TP‑8 for prefill and small‑scale EP‑16 for decode to balance TTFT and resilience blog post.
  • Kernels: FP8 FlashMLA and SwapAB GEMM push matrix ops throughput on H20 blog post.
  • Scheduler: Single‑Batch Overlap improves throughput; Expert Affinity load balancing stabilizes MoE traffic blog post.
  • Ops: Lightweight observability and tiered SLA deployment patterns meant for real production use best practices link.
  • Why it matters: Demonstrates a repeatable path to strong R1 serving throughput on export‑compliant H20s, not just top‑end GB200s blog post.

TensorRT‑LLM ‘Scaffolding’ integrates Dynasor to cut reasoning tokens by up to 29%

NVIDIA’s new inference‑time compute framework in TensorRT‑LLM now supports Dynasor, a certainty‑probing method that trims chain‑of‑thought generation, reporting up to 29% token reduction without accuracy loss and a NeurIPS’25 accept NV integration, TRT‑LLM blog, and arXiv paper.

  • Idea: Probe mid‑reasoning to detect stabilization and early‑exit extra thinking tokens; applies across o1/o3‑style reasoners and MoE stacks Dynasor blog.
  • Ecosystem: Packaged alongside TensorRT‑LLM “Scaffolding” to schedule inference‑time compute, improving cost and throughput at serve TRT‑LLM blog.
  • Availability: Reference code and servers provided via vLLM and OpenAI‑compatible endpoints for drop‑in trials Dynasor repo.

Superhuman cuts embedding P95 latency 80% to 500 ms with Baseten’s BEI runtime

Email app Superhuman reports an 80% reduction in P95 embedding latency down to 500 ms after moving to Baseten Embeddings Inference (BEI) with a performance client and multi‑cloud capacity management customer announcement, and case study page.

case study graphic

  • Stack: TensorRT‑LLM‑based BEI runtime for embeddings/re‑rankers/classifiers; per‑region elastic scaling via MCM case study page.
  • Rollout speed: Migration completed in about a week while maintaining feature parity for search, classification, and retrieval case study page.
  • Outcome: Lower tail latencies and higher throughput without growing the infra team; illustrates practical latency wins beyond model swaps customer announcement.

🛡️ Agent security and governance updates

Mostly concrete exploit/mitigation news. Today centers on Salesforce Agentforce prompt‑injection exfiltration (CVSS 9.4) and Trusted URL allowlists. Excludes xAI lawsuit (repeats prior day).

Salesforce locks down Agentforce after CVSS 9.4 prompt‑injection exfiltration; Trusted URL allowlists enforced

A CVSS 9.4 “ForcedLeak” flaw in Salesforce Agentforce allowed indirect prompt injection to exfiltrate CRM data via crafted image URLs, exploiting an expired allow‑listed domain, according to Noma Labs and a detailed follow‑up analysis Noma blog, research blog, blog writeup, and analysis post. Salesforce is now enforcing Trusted URL allowlists for Agentforce and Einstein Generative AI agents (effective Sept 8, 2025) to block outbound exfiltration channels Salesforce policy, Salesforce advisory. This mirrors the broader “lethal trifecta” pattern—private data + untrusted content + external comms—where cutting any leg (e.g., egress) sharply reduces risk trifecta post, blog post.

  • Attack path: Indirect prompt injection in Web‑to‑Lead data instructed the agent to embed leaked records in an <img> URL to a domain that had been (mistakenly) allow‑listed; the domain had expired and was re‑registered by researchers blog writeup.
  • Mitigation: Salesforce’s Trusted URL allowlists now restrict agent outputs from posting to untrusted destinations; customers should also tighten CSPs and sanitize untrusted inputs before agent consumption Salesforce policy.
  • Why it matters: Agent pipelines often combine sensitive context with untrusted inputs and have network egress—closing egress or isolating tool outputs breaks the exfiltration loop trifecta post.
  • In context of prompt injection (LinkedIn assistant hijack): another concrete case where untrusted content steers agents; today’s update shows vendors moving to enforce egress controls by default.

💼 Enterprise adoption and org shifts

Signals of spend and operating change. New today: Accenture to exit staff who can’t reskill for AI (with training scale/targets), Meta × Google ad‑model talks, and ElevenLabs agent ROI in industrial staffing.

Accenture to exit staff who can’t reskill for AI; 550k trained, 77k in AI/data roles

Accenture is accelerating an AI-first org reshuffle: CEO Julie Sweet said the company will “exit” employees who cannot quickly reskill for AI, after already cutting 11,000 roles in the last three months CNBC screenshot.

CNBC headline

  • Training scale: 550,000 employees trained on generative AI, with 77,000 now in AI and data roles CNBC screenshot.
  • Funding the shift: A six‑month $865M optimization plan and two divestitures target >$1B savings to reinvest in AI capabilities CNBC screenshot.
  • Revenue baseline: FY25 revenue was $69.7B (+7% y/y), attributed in part to client AI upskilling demand CNBC screenshot.
  • Operating implication: Expect faster rotation of talent into AI workflows and stricter performance baselines tied to AI productivity—procurement and partners should prepare for compressed ramp times and skills verification.

Meta explores using Google’s Gemini/Gemma to boost Facebook/Instagram ad targeting

Meta has held exploratory talks to evaluate Google’s Gemini and Gemma on Meta’s ad corpus to improve targeting and creative understanding, including proposals to fine‑tune models on platform‑specific ad signals The Information story, story link, deal summary.

Report headline

  • What’s on the table: General models parse ad text/captions/images into machine features that feed existing rankers; serving and auctions remain in‑house to control latency and data boundaries analysis thread.
  • Adoption signal: Highlights Meta’s willingness to augment in‑house AI with external models after internal scaling challenges, while keeping core ad systems proprietary The Information story, deal summary.
  • Market read: Mild reaction at report time (Alphabet up ~1% AH, Meta down ~0.5%), reflecting optionality rather than a done deal analysis thread.

ElevenLabs Agents scale Traba hiring: 50k+ interviews/month, 85% vetting automated

Industrial staffing platform Traba now runs >50,000 monthly voice interviews with ElevenLabs Agents, automating 85% of worker vetting and lifting shift completion rates by 15% for AI‑qualified workers versus human‑qualified cohorts deployment brief, results thread, ElevenLabs blog.

  • Throughput and savings: >4,000 operator hours saved per month via multi‑agent flows and multilingual interviews (English/Spanish) results thread, ElevenLabs blog.
  • Quality delta: Consistent rubric‑based evaluations across roles/shifts/regions correlate with higher completion rates post‑placement results thread.
  • Integration notes: Productionized with Twilio voice I/O and server tools; design mirrors call‑center QA at scale but with deterministic prompts and logging tutorials playlist, twilio integration.

🏗️ AI factories, power and partnerships

Infra narratives continue. Today: Together AI outlines ‘Frontier AI Factories’ stack; Groq becomes McLaren F1 Official Partner for inference; power demand commentary tied to AI DC growth.

Groq becomes McLaren F1 Official Partner to power real‑time inference and analytics

Groq signed on as an Official Partner of the McLaren Formula 1 Team, bringing its LPU‑based inference to support decision‑making, analysis, and real‑time insights; the Groq logo appears on cars starting at the Singapore GP. partnership post, and newsroom post

groq homepage

  • Partnership underscores inference‑at‑speed use cases where latency and cost are critical on and off the track. announcement
  • Groq positions its vertically‑integrated stack (LPU + GroqCloud) for low‑latency telemetry and workloads under race constraints. newsroom post
  • Signals growing sports/industrial demand for deterministic, fast inference pipelines over generic cloud GPU stacks. partnership post

US needs 100 GW of new firm power in five years as AI data centers drive demand

The US must add roughly 100 GW of firm capacity in five years, reversing a planned 80 GW reduction, as AI data centers fuel the first real electricity‑demand growth in years, per Chris Wright’s remarks. power comment In context of capacity gap, Eric Schmidt’s earlier warning of a ~92 GW shortfall.

Fortune headline

  • AI build‑out is shifting grid planning from retirements to rapid additions of reliable generation. power comment
  • Broader macro takes flag that AI capex is propping up growth while productivity gains lag, adding urgency to grid expansions. Fortune summary
  • Expect siting frictions, cost‑allocation debates, and utility rate cases to accelerate around AI‑linked interconnect queues. power comment

Together AI outlines ‘Frontier AI Factories’ stack: NVIDIA compute, InfiniBand and AI‑native storage

Together AI detailed its Frontier AI Factories design, emphasizing NVIDIA‑accelerated compute, high‑bandwidth InfiniBand networking, AI‑native storage, and a proprietary software stack built for scale, in a NYSE/TheCUBE segment. NYSE segment, stack outline, and watch live

  • The stack targets enterprise AI with end‑to‑end control: GPUs, fabric, storage, and orchestration tightly integrated. stack outline
  • Messaging focused on letting customers “own their AI” while accelerating model deployment on factory‑grade infra. watch live
  • Expect implications for cost, reliability, and time‑to‑market as more providers standardize this factory blueprint. NYSE segment

🎙️ Voice agents and telephony hooks

Smaller but practical: ElevenLabs expands agent deployments and SIP integrations. Today includes Twilio integration guidance and an ops case study. Excludes native audio core model updates from prior days.

Traba automates 50k+ interviews/month with ElevenLabs Agents; 85% vetting now fully automated

Industrial staffing platform Traba reports its AI interviewer “Scout,” built on ElevenLabs Agents, now conducts 50,000+ interviews per month—automating 85% of vetting and saving 4,000+ operator hours monthly customer rollout, results summary. Full details in the published case study case study.

  • Multilingual voice flows (English/Spanish) and a multi‑agent design lift consistency and outcomes customer rollout
  • +15% shift completion for AI‑qualified vs human‑qualified workers across warehousing/logistics/manufacturing roles results summary
  • Traba targets 100% automated vetting later this year; measurable ROI for AI ops leaders tracking staffing throughput results summary

ElevenLabs Agents publish Twilio call integration guides for production voice bots

ElevenLabs shared step‑by‑step guides to wire Agents to Twilio for inbound and outbound calling—so teams can stand up real phone numbers and SIP/PSTN flows—following up on initial launch of their multilingual Agents. See the walk‑throughs and examples in their tutorial playlist Twilio how‑to and the companion series on adding server tools agents tutorials, with details in the YouTube playlist.

  • Shows how to connect Agents to Twilio for inbound and outbound calls with minimal glue code Twilio how‑to
  • Covers adding server tools and production patterns (e.g., call routing, actions) for voice deployments agents tutorials
  • Useful for IVR, interviewers, and support bots where SIP/PSTN and call control matter Twilio how‑to

📜 Reasoning, training and multimodal research

Heavier research day: energy accounting, CUDA kernel robustness, episodic memory, soft‑embedding RAG, unified T2I/I2I understanding, and Gödel Test math proofs. Avoids bio/wet‑lab topics.

One video model performs 62 tasks zero‑shot, edging into visual reasoning

DeepMind reports a single video model (Veo 3) handling 62 tasks across 18,384 videos with instruction prompts alone—covering segmentation, edge detection, editing, physics intuition, and simple maze reasoning, without task‑specific training paper abstract, in context of Chain‑of‑Frames prior zero‑shot video reasoning.

Paper abstract

  • Matches or nears specialized editors on segmentation/edges; improves maze solving vs prior; still weak on depth/normals and some force‑following paper abstract.
  • Promptable, general model reduces the need for many per‑task vision models and simplifies workflows paper abstract.

Episodic memory fixes LLM ‘latent learning’ gaps in DeepMind study

DeepMind shows that retrieval‑augmented episodic memory lets models reuse prior experiences to solve reversals and other generalization failures that parametric weights miss paper summary. The work identifies “latent learning” as a root cause: models notice facts during training but don’t store them in a way that transfers.

Paper title page

  • Episodic memory fetches matching examples at inference, enabling in‑context reasoning to apply old knowledge to new asks (e.g., reversing relations) paper summary.
  • Without retrieval, reversals only work when the forward fact is present in context; with memory, models answer cold queries correctly paper summary.
  • Training on within‑example in‑context learning teaches the model to read and use retrieved memories effectively paper summary.

MANZANO unifies vision understanding and generation with a hybrid tokenizer

Apple introduces MANZANO, a unified multimodal LLM that reads and generates images using a single vision encoder with two lightweight adapters—continuous tokens for understanding and discrete tokens for generation—keeping both in one semantic space paper first page.

Paper title page

  • The LLM outputs text and image tokens; a diffusion decoder renders pixels, preserving strong perception while enabling high‑quality T2I paper first page.
  • Hybrid tokenizer beats pure discrete and dual‑encoder baselines; scaling the LLM (300M→30B) yields steady understanding and instruction‑following gains paper first page.
  • Design separates “meaning in the LLM” from “pixels in diffusion,” avoiding cross‑task interference common in unified stacks paper first page.

Synthetic bootstrapped pretraining recovers ~43% of 20×‑data gains

Apple and Stanford present Synthetic Bootstrapped Pretraining (SBP): learn document‑to‑document relations, then synthesize new texts from seeds to augment pretraining. On 3B models with 200B/1T tokens, SBP beats naive repetition and recovers ≈42–48% of the gains of a 20× larger unique dataset paper page.

Paper title

  • Pipeline: nearest‑neighbor pairing → doc‑from‑doc generator → joint training on real + synthetic; outputs generalize core concepts rather than copying paper page.
  • Gains show up in validation loss and QA; suggests a path for data‑constrained scaling without overfitting paper page.

Under “infinite compute,” regularization + ensembles + distillation yield 5.17× data efficiency

Stanford explores pretraining when compute is plentiful but data is scarce: heavy weight decay to curb memorization, train multiple members, then distill—achieving 5.17× data efficiency and retaining 83% of ensemble gains in a single student paper abstract.

Paper abstract

  • Tracks the achievable loss floor (with unlimited compute) to compare recipes fairly; ensembles lower this floor more than a single bigger model paper abstract.
  • Distillation compresses diverse patterns learned by members into one fast model while preserving most quality paper abstract.

Gödel Test: GPT‑5 nears proofs on 3/5 new “easy” conjectures, refutes 1

A new Gödel Test evaluates if LLMs can prove fresh, simple conjectures (not contest retreads). On five combinatorial optimization tasks, GPT‑5 produced nearly correct proofs on three (after small fixes), found an alternative bound on one, and failed one requiring cross‑paper synthesis paper abstract.

Abstract screenshot

  • Protocol hides the exact claim; model sees a short task description and 1–2 related papers; human verification checks proofs paper abstract.
  • Highlights progress and limits: originality on some tasks, but convincing mistakes and prompt sensitivity remain paper abstract.

Robust‑kbench raises the bar for agent‑generated CUDA; up to 2.5× forward speedups

Sakana AI releases robust‑kbench and a verification pipeline for translating PyTorch ops into optimized CUDA kernels that hold up under randomized inputs—reporting up to 2.5× forward speedups on common ops, with smaller backward gains paper page.

Paper first page

  • Prior benches were easy to game (shape hard‑coding, skipped work); robust‑kbench randomizes, checks both passes, and profiles with PyTorch/NVIDIA tools paper page.
  • Soft verifiers reject compile/memory/numeric bugs, pushing valid forward proposals to ≈80% before runtime tests paper page.
  • Found kernels transfer to ResNet blocks and norms (LayerNorm/RMSNorm) while avoiding overfitting to single shapes paper page.

Soft‑embedding RAG hits 99.95% top‑1 and 2.62× faster via federated DP

webAI proposes freezing an SLM encoder and training only a tiny adapter to produce “soft embeddings,” plus a classifier‑as‑retriever head—achieving 99.95% top‑1 on a spam task and up to 2.62× federated speedups with differential privacy overview, results.

Architecture sketch

  • Frozen encoder + MIPS baseline scored 12.36% top‑1; adapter‑only 96.79%; classifier‑only 97.02%; adapter+classifier 99.95% results.
  • Federated updates train only the small adapter/head; 1.75× speedup on 2 machines and 2.62× on 3 when backprop must pass the large stack results.
  • DP via clipped gradients + Gaussian noise keeps accuracy ≈99% while protecting client data privacy details.
  • Blog and paper detail implementation and trade‑offs for edge and regulated settings blog post.

On this page

Executive Summary
🗞️ Feature: AI feeds go proactive (Pulse, Vibes)
ChatGPT Pulse day‑2: mixed reviews on depth, sources, and UX
Meta’s Vibes AI video feed goes live on the web, including EU
Debate flares: Are AI feeds ‘creation tools’ or ad‑ready content streams?
xAI’s Grok eyed as X’s AI curator, with voice‑first app and translate assists
🧰 Agentic coding and dev tooling
Claude Code introduces subagents that act like a coordinated dev team
Cline’s stealth model gets a 1M‑token context window for giant repos
Factory CLI adds subagents/custom Droids for repeatable coding workflows
Warp ships a CLI to run its agents from any terminal, remote host, or CI
Codex 0.42.0 adds secure mode, safer exec, and a revamped /status
Local stack: Cline + LM Studio + Qwen3 Coder 30B on Apple Silicon
🧩 MCP in production (Code Mode and content ops)
Cloudflare teases AI Index: built‑in MCP servers for your site
MCP Code Mode turns tools into TypeScript APIs
🧪 New and upgraded models
Qwen3 Max hits GA with measurable uplifts across tool use, coding, and long context
ByteDance Lynx hits fal with single‑photo→video and strong identity lock
Tencent open‑sources 3D part modeling and controllable 3D asset generation
Ollama lights up cloud runs for Kimi‑K2 1T and DeepSeek V3.1 671B
📊 Real-world evals and leaderboard shifts
Gemini 2.5 Flash posts fresh third‑party gains and ranks, at roughly half the cost of peers
ARC Prize 2025 score jumps to 27.08% as Giotto.ai tops the board
Qwen3 Max lands 62.4% on SWE‑Bench Verified with detailed cost/latency snapshot
Browser WebBench: Gemini 2.5 Flash matches o3 accuracy while 2× faster and ~4× cheaper
Terminal‑Bench realism questioned: “too many contrived tasks,” calls to fix agent evals
🎬 Video/edit stacks in the wild
ByteDance Lynx lands on fal: single-photo → video with strong ID retention at $0.60/sec
WAN 2.2 Animate: native site and Alibaba ‘pro’ beat fal/Replicate in user tests
ComfyUI hosts Wan 2.2 Animate showcase; Qwen Image Edit 2509 ships with native support
Kling 2.5 Turbo vs Luma Ray 3: quick side-by-side suggests Kling edge on realism
🔎 RAG systems and reasoning search
Perplexity trials “Sonar Testing” with a reasoning toggle for deeper answers
Classifier‑as‑Retriever beats MIPS: 99.95% top‑1 with soft embeddings and FL+DP
DeepMind: episodic memory fixes “latent learning” gaps that weights miss
RAG office hours: blend dense vectors with sparse/structured signals, then route
🚀 Serving, token efficiency and latency wins
SGLang + Ant Group hit 16.5k/5.7k tok/s per H20 node with a hardened DeepSeek‑R1 serve recipe
TensorRT‑LLM ‘Scaffolding’ integrates Dynasor to cut reasoning tokens by up to 29%
Superhuman cuts embedding P95 latency 80% to 500 ms with Baseten’s BEI runtime
🛡️ Agent security and governance updates
Salesforce locks down Agentforce after CVSS 9.4 prompt‑injection exfiltration; Trusted URL allowlists enforced
💼 Enterprise adoption and org shifts
Accenture to exit staff who can’t reskill for AI; 550k trained, 77k in AI/data roles
Meta explores using Google’s Gemini/Gemma to boost Facebook/Instagram ad targeting
ElevenLabs Agents scale Traba hiring: 50k+ interviews/month, 85% vetting automated
🏗️ AI factories, power and partnerships
Groq becomes McLaren F1 Official Partner to power real‑time inference and analytics
US needs 100 GW of new firm power in five years as AI data centers drive demand
Together AI outlines ‘Frontier AI Factories’ stack: NVIDIA compute, InfiniBand and AI‑native storage
🎙️ Voice agents and telephony hooks
Traba automates 50k+ interviews/month with ElevenLabs Agents; 85% vetting now fully automated
ElevenLabs Agents publish Twilio call integration guides for production voice bots
📜 Reasoning, training and multimodal research
One video model performs 62 tasks zero‑shot, edging into visual reasoning
Episodic memory fixes LLM ‘latent learning’ gaps in DeepMind study
MANZANO unifies vision understanding and generation with a hybrid tokenizer
Synthetic bootstrapped pretraining recovers ~43% of 20×‑data gains
Under “infinite compute,” regularization + ensembles + distillation yield 5.17× data efficiency
Gödel Test: GPT‑5 nears proofs on 3/5 new “easy” conjectures, refutes 1
Robust‑kbench raises the bar for agent‑generated CUDA; up to 2.5× forward speedups
Soft‑embedding RAG hits 99.95% top‑1 and 2.62× faster via federated DP