Wed, Sep 10, 2025

MBZUAI K2‑Think 32B – ~2,000 tok/s; parity claim vs larger models

Stay in the loop

Get the Daily AI Primer delivered straight to your inbox. One email per day, unsubscribe anytime.

Executive Summary

MBZUAI’s K2‑Think 32B steps into the open with speed and swagger: ~2,000 tok/s per request on Cerebras inference and claimed parity with much larger models in math, code, and science. At the same time, Hugging Face pull requests tee up Alibaba’s Qwen3‑Next 80B MoE (~3B active parameters) and Qwen3‑VL across 4B dense and 30B‑A3B variants. Access widens elsewhere too: Duck.ai now lists GPT‑OSS 120B for anyone, and a 230M‑param nanoVLM tops SmolVLM2‑250M while running under 1.5GB RAM.

In numbers:

  • 32B parameters; ~2,000 tok/s per request on Cerebras inference
  • Qwen3‑Next: 80B total; ~3B active parameters; transformers PR signals day‑one support
  • Qwen3‑VL: 4B dense; 30B‑A3B MoE; MRope and DeepStacked ViT upgrades
  • Duck.ai model picker: GPT‑OSS 120B available free; no signup required
  • nanoVLM‑230M averages 53.2 vs SmolVLM2‑250M at 48.1 across listed benchmarks
  • Inference under 1.5GB RAM at 2048×2048 and 300 tokens

Also:

  • Moonshot checkpoint‑engine updates 1T parameters across thousands of GPUs in ~20s
  • SimpleQA Verified trims to 1,000 prompts; Gemini 2.5 Pro scores 54.5% F1

📑 Table of Contents

🤖 Embodied AI and Robot Control

A few embodied items: MCP control of ROS robots via LLMs, progress on dexterous hands, TI’s Spot data patrols; Optimus clips. Mostly demos and enabling control stacks; heavy ROS MCP coverage lives in MCP section.

Open ROS MCP Server lets Claude, ChatGPT and Gemini operate ROS robots

An open-source ROS MCP Server bridges Model Context Protocol to ROS1/2 so LLMs can list topics/services/actions, read sensors, and issue commands without changing robot code. It’s cross‑OS (Linux/Windows/macOS), bidirectional via rosbridge, and demoed on Unitree Go and industrial debugging Release notes, Gemini demo, Getting started, Docs.

LLM↔ROS bridge diagram

Yuansheng Intelligent shows 21‑DOF humanoid hands with tactile skin and 30 kg payload

New humanoid hands from Shenzhen Yuansheng Intelligent feature tendon actuation, 21 degrees of freedom, full tactile skin, 30 kg load per hand (2.3 kg per finger), and support for 33 grasp types—aimed at dexterous, force‑aware manipulation on bipedal platforms Specs post.

Texas Instruments’ Spot robot patrols RFAB to capture production data

TI is using Boston Dynamics’ Spot at its Richardson, Texas RFAB facility to run routine patrols and collect floor data that helps engineers identify trends and process issues over time. The deployment emphasizes autonomous inspection loops for manufacturing intelligence and uptime TI update.


🛡️ Safety, Security and Licensing

Cross‑lab alignment evals, AI in ransomware stat from MIT Sloan/Safe Security, Anthropic $1.5B settlement pushback, and Really Simple Licensing (RSL) for crawler pay‑to‑use. Mostly governance/licensing and threat reports.

MIT Sloan/Safe Security: AI used in 80% of 2,800 ransomware cases; 3‑pillar defense model

A study of 2,800 ransomware incidents finds AI in 80% of attacks—deepfakes, phishing, malware generation, AI‑driven password cracking—recommending automated hygiene (zero‑trust, self‑healing), autonomous/deceptive defenses, and augmented executive oversight with real‑time intel Study. This adds case counts and specifics in context of MIT Sloan 80% ransomware, which highlighted the topline rate and defense pillars.

ChatGPT Developer Mode enables write‑tool MCP connectors, with strong safety warnings

OpenAI rolled out a Developer Mode that adds full MCP client support for read/write tools—e.g., updating Jira or triggering automations—substantially expanding ChatGPT’s reach into live systems Dev mode. The settings panel flags risks: unverified connectors can modify or erase data; users should guard against prompt injections and malicious servers before enabling Risk warning, How‑to.


🎙️ Voice and Real‑Time Experiences

Notable voice updates: ElevenLabs Voice Remixing alpha, Microsoft Copilot voice/avatars and private chats, OpenAI keeps Standard Voice Mode, Kyutai DSM streaming seq2seq, Gemma 3n on‑device STT/translate. Mostly product capabilities with latency focus.

Microsoft puts Voice Mode and avatars on Copilot home screen, adds Private Conversations

Microsoft is bringing Copilot Voice Mode directly to the Copilot home screen, enabling immediate voice chats with Copilot Appearance (avatar), and rolling out Private Conversation support for sensitive sessions Feature drop. Details indicate a dedicated entry point and privacy controls, with a longer write‑up linked in the full scoop thread More info; the update is being amplified across product‑watcher feeds Signal.

ElevenLabs launches Voice Remixing alpha for controllable agent voices

ElevenLabs introduced Voice Remixing in alpha, letting teams transform existing cloned or designed voices by prompt to change gender, age, and accent for more local, conversational agents Launch. It supports any IVC/PVC clone or Voice Designed asset via the Voice Library, with examples like shifting a British studio voice to a headset‑style American accent for US launches Use case, and a preview UI showing promptable transformations (e.g., “Make this voice elderly and tired”) UI.

Voice Remixing alpha UI

Gemma 3n Edge app ships on‑device STT and speech‑to‑translated‑text

Google’s Gemma 3n Edge app now performs on‑device speech‑to‑text across multiple languages and speech‑to‑translated‑text, with batch processing of audio clips up to 30s; streaming and an iOS app are slated next App launch. The Edge Gallery listing highlights local inference for speech, text and images, while docs detail the new audio features and pipelines for offline use Feature notesRepo+Play link.

Kyutai DSM unlocks low‑latency streaming ASR↔TTS with open code

Kyutai presented DSM, a streaming sequence‑to‑sequence approach that handles ASR to TTS (and back) with state‑of‑the‑art latency in the few‑hundred‑millisecond range while remaining competitive with offline baselines Overview. A decoder‑only LM plus pre‑aligned delayed streams supports infinite sequences, batching, and stable throughput across delays; paper and repo are publicly available Paper+repo.

Throughput vs delay


🧭 RAG, Retrieval and Document Pipelines

Useful RAG updates: LlamaParse gains PPTX speaker notes, Chroma Package Search for dependency code, classroom RAG study comparing vector vs GraphRAG, and a prompt‑based ‘semantic firewall’ to reduce RAG hallucinations.

Comparative classroom RAG: vector search vs GraphRAG with a router

A new study builds EduScopeQA (3,176 Qs across history, lit, science, CS) and finds vector search best for short factuals, GraphRAG‑Global for broad themes, and GraphRAG‑Local for long, detailed texts; a router picks per‑query to balance cost/quality Paper recap. Altered‑textbook tests show systems must resist stale parametric knowledge, favoring retrieval‑grounded answers Paper recap.

EduScopeQA title page

Claude API adds native web fetch for agentic retrieval

Anthropic’s API now includes a web_fetch tool (public beta) so Claude can fetch/analyze URLs directly—handy for agents that cite docs, specs, and papers without standing up scrapers API announcement, Code example. Use allowlists and scoped domains to mitigate prompt‑injection/data‑exfil risks in link‑driven RAG Use cases, Safety tips.

LlamaParse adds PowerPoint speaker notes parsing

LlamaIndex shipped speaker‑notes extraction for .pptx so ingest can capture slide context as structured metadata across all LlamaCloud modes LlamaIndex demo, Release note. That boosts slide‑based RAG (briefs, training decks) where key details live off‑slide. In context of LlamaCloud parsing modes, which added parsing options for visually complex docs, this extends coverage to presenter notes for better retrieval grounding.

WFGY ships prompt‑only semantic firewall for RAG stability

WFGY inserts pre‑answer state checks—ΔS (semantic drift), λ (convergence), and coverage—so models pause/reset when retrieved chunks don’t align, rather than post‑hoc patching Project write‑up. Authors report 90–95% stability on covered cases vs 70–85% with patch stacks; example shows blocking warranty hallucinations from refund text via drift/divergence triggers Method card, Bench snapshot.

Chroma launches Package Search for dependency code retrieval

Chroma introduced Package Search so agents can traverse and query dependency source code with turnkey flows (indexing + retrieval) rather than ad‑hoc scraping Chroma launch. For AI engineers wiring repo‑scale RAG, this reduces glue code and improves precision on 3rd‑party APIs and libs referenced in local projects.

V4 multimodal embeddings land in llama.cpp with near PyTorch parity

JinaAI patched llama.cpp (attention mask + ViT projection to match conv3d path) to generate V4 multimodal embeddings from GGUF, showing near‑identical averages to the PyTorch reference on ViDoRe/MTEB tasks—even for quantized variants Tech note, Results table. This unlocks local multimodal retrieval for image‑rich RAG pipelines on lightweight runtimes Repo/blog.


🧮 Chips and Accelerators

Hardware cadence continues: NVIDIA Blackwell Ultra MLPerf inference records, Rubin CPX 128GB GDDR7 context GPU for long windows, Broadcom $10B ASIC orders incl. OpenAI, DCQCN Test‑of‑Time award context for RDMA at scale.

NVIDIA Blackwell Ultra sets new MLPerf inference records

Blackwell Ultra posted 5,842 tok/s/GPU offline and 2,907 tok/s/GPU server on DeepSeek‑R1—4.7× and 5.2× faster than Hopper; it also set 138 tok/s/GPU on Llama 3.1 405B interactive. Gains come from NVFP4 math, 2× attention compute, larger HBM3e, and dense NVLink fabrics, plus FP8 KV‑cache and expert/data parallelism with ADP Balance MLPerf blog, Perf table, Per‑accelerator table.

NVIDIA Rubin CPX debuts as a dedicated context‑phase GPU with 128GB GDDR7

Rubin CPX targets prefill/context with 30 PFLOPs NVFP4, 128GB GDDR7 and 3× attention accel vs GB300 NVL72; a Vera Rubin NVL144 CPX rack pairs 144 CPX + 144 Rubin GPUs + 36 CPUs for ~8 EFLOPs NVFP4, ~100TB fast memory and 1.7 PB/s bandwidth. Dynamo orchestrates KV‑cache handoff; dedicated video codecs enable ~1M‑token video contexts without cross‑chip shuttling Rubin CPX specs, Video pipeline.

Rubin CPX 128GB GDDR7

Broadcom secures ~$10B AI ASIC orders with OpenAI mass production by 2026

Reports indicate Broadcom won ~$10B in AI ASIC orders, with OpenAI slated for mass production by 2026; ByteDance, Apple and xAI are also in line for 2026–2027, positioning Broadcom as a serious alternative to general‑purpose GPUs for hyperscale inference/training economics Report summary, Follow‑on note.

DCQCN wins SIGCOMM Test of Time for RDMA at AI scale

DCQCN (Data Center Quantized Congestion Notification) earned SIGCOMM’s Test of Time; designed at MSR, it underpins large‑scale RDMA deployments used in modern LLM training/serving (e.g., GPT‑4.5 era systems). The backstory underscores production‑first validation before publication, explaining its decade‑long durability Award note, Award note.


🧪 Training, RL and Reasoning Methods

Wave of training methods beyond plain SFT: DCPO for RLHF stability/exploration, Parallel‑R1 for mid‑training parallel thinking, Meta’s Language Self‑Play data‑free RL, TraceRL for diffusion LMs. Also GRPO over DSPy programs.

Parallel‑R1 trains parallel thinking via a mid‑training exploration scaffold

A simple supervised warm‑up teaches the <Parallel><Path> format; then RL on easy math rewards correct answers that include parallel blocks; finally accuracy‑only RL on harder sets lets models trigger parallelism only when useful. Early training encourages exploration; later stages use parallelism mainly to verify a candidate solution, improving final accuracy without bloating traces Tech report. Authors release data/code pointers for reproduction Tech report.

TraceRL rewards real decoding steps; TraDo‑4B/8B beat strong 7B/8B baselines

New results show +6.1% and up to +51.3% gains on math/coding by rewarding step‑wise diffusion traces (serving‑aligned) and adding a per‑step value model for stability Paper. Works for full and block attention; enables larger block sizes (4→8) without loss and supports very long chains. Toolkit ships training recipes and evals for reproduction Paper. In context of TraceRL, which first introduced trajectory‑aware RL for diffusion LMs, these numbers quantify the edge over 7B/8B baselines.

Baichuan’s DCPO lifts math accuracy by stabilizing RLHF and boosting exploration

DCPO introduces dynamic adaptive clipping (fewer over‑clipped rare tokens) and smoothed advantage standardization (steady gradients under flat rewards), beating GRPO/DAPO across Qwen2.5 1.5B→14B on AIME/MATH, with higher Avg@32 (robustness), lower token clipping ratio (TCR) and better response utilization (RUR) DCPO explainer. Plots show sustained Avg@1/Avg@32 gains and lower TCR over 400 steps across sizes DCPO explainer.

ByteDance’s AgentGym‑RL unifies multi‑turn agent training without SFT

AgentGym‑RL offers a modular RL stack spanning web, search, games, and embodied/science tasks; agents rival or surpass commercial systems on 27 tasks, showing the benefit of closed‑loop optimization over SFT‑only pipelines Announcement. Project page, repo, and abstract detail the suite and extensibility for new environments and evaluators Links.

Running GRPO on DSPy programs is now a practical path to RL over agents

DSPy maintainers highlight that combining declarative Signatures with GRPO lets teams optimize prompts, weights, and inference compute—beyond “prompt engineering”—if GPUs are available How‑to, Reminder. The team notes DSPy has supported offline RL since 2023 and GRPO since 2025; declarative abstractions make swapping optimizers and adding RL scaffolds straightforward Context.

RL beats SFT for open‑web deep research: data curation, rewards, coordination surveyed

First dedicated survey systematizes RL for research agents: multi‑hop data synthesis, outcome/format rewards with credit assignment, masking tool text, when to search vs recall, multimodal extensions, and RL training systems (Agent Lightning, AREAL, SLIME, etc.) Survey, Structure. Benchmarks span long‑form QA and domain tasks; takeaway—RL improves robustness and recovery compared to SFT/preference‑only setups Benchmarks.

Reverse‑Engineered Reasoning recovers plans from good outputs to train writers

REER scores candidate plans by surprise on a reference answer and locally refines segments, creating DeepWriting‑20K (20k traces, 25 categories). Fine‑tuning Qwen3‑8B on these plans yields DeepWriter‑8B, which beats open baselines and approaches proprietary systems on creative writing (LongBench‑Write) without RL or distillation Paper.


⚙️ Serving, Weight Updates and Runtime

Infra advances dominated: Moonshot’s checkpoint‑engine (near‑sync RL weight updates), vLLM determinism and collab on checkpoint‑engine, NVIDIA+Google Dynamo prefill/decode disaggregation on GKE, llama.cpp multimodal embeddings parity.

Moonshot open-sources checkpoint‑engine for near‑sync RL weight updates at trillion‑parameter scale

Moonshot released checkpoint‑engine, enabling in‑place weight updates for Kimi K2‑scale models (1T params) across thousands of GPUs in about 20s, with both broadcast (sync) and P2P (elastic) modes plus overlapped H2D/copy pipelines Moonshot announce. vLLM says it’s collaborating: ~20s 1T broadcasts on 1000s of GPUs and dynamic P2P for elastic clusters vLLM collab. A deep dive details ~60× step‑wise weight‑transfer optimizations vLLM deep dive. In context of slime 64→64 in ~8s, this pushes near‑real‑time RL updates into mainstream serving.

Checkpoint flow

Dynamo recipe splits prefill and decode pools on GKE with H200s and vLLM for cheaper, faster serving

A joint NVIDIA+Google recipe disaggregates LLM serving: compute‑bound prefill and bandwidth‑bound decode run on separate GPU pools, with NVIDIA Dynamo routing and transferring KV cache between them Recipe overview. It runs on GKE A3 Ultra (H200) and vLLM (PagedAttention), improving throughput/latency and scaling each phase independently under mixed prompt lengths GKE details.

llama.cpp gains multimodal V4 embeddings with GGUF after vision and attention fixes

JinaAI fixed llama.cpp’s attention mask over image tokens and matched PyTorch’s conv3d vision projection (via flattened conv3d weights), enabling V4 multimodal embeddings in GGUF Jina explain. On ViDoRe/MTEB tasks, F16 and quantized GGUF variants match or slightly beat the PyTorch reference on average (≈84.2 vs 84.17), making local multimodal embedding serving production‑viable Results.

Causal vs non‑causal

vLLM details deterministic inference and releases a live internals notebook

A new tutorial shows how to achieve deterministic outputs with vLLM (seeds, scheduler choices, CUDA reproducibility, and I/O ordering caveats) vLLM determinism. A companion live notebook explores vLLM internals to help engineers reason about execution and reproducibility under real workloads vLLM internals. Together they reduce “heisenbugs” in prod serving by standardizing config and traceability.


📈 Evals, Leaderboards and Observability

Lots of eval work: SimpleQA Verified (1k prompts), creative writing V3 leaderboard (Kimi K2‑0905 leads), OpenAI Evals adds audio graders, cross‑lab safety pilot notes, Physics‑IQ for video physics understanding. Mostly eval releases and methodology debates.

OpenAI Evals adds native audio inputs and graders

OpenAI shipped native audio inputs and audio graders to Evals, enabling direct assessment of model speech responses without transcription; a Cookbook guide walks through setup and scoring workflows OpenAI Evals audio, Teaser. This brings eval coverage to voice agents and call‑center use cases with standardized rubrics and pass/fail thresholds OpenAI Evals audio.

SimpleQA Verified leaderboard crowns Gemini 2.5 Pro

DeepMind’s SimpleQA Verified trims to 1,000 clean factuality prompts with reproducible grading; the leaderboard shows Gemini 2.5 Pro at 54.5% ±3.2% F1, ahead of o3 (52.3%), Grok‑4 (51.9%) and GPT‑5 (51.6%) Leaderboard, Paper+LB. The set rebalances topics, diversifies answer types, and avoids external search to measure parametric knowledge Leaderboard.

Physics‑IQ finds video models weak on real‑world physics

On Physics‑IQ, the best model (VideoPoet multiframe) scores 29.5%, far below real‑video agreement caps; multiframe variants generally beat i2v Scores, Paper. Visual realism (e.g., Sora) is uncorrelated with physics; tasks span solids, fluids, optics, thermodynamics, magnetism with motion‑based grading of predicted futures Paper, Scores.

Anthropic ships Claude Code Analytics API for org metrics

New Analytics API exposes daily org‑level metrics for Claude Code, bridging the gap between the console dashboard and full OpenTelemetry integration; available to teams today with docs for endpoints and fields Launch, Why it exists, Docs link. Useful for tracking seat adoption, run counts, and success rates across engineering groups Launch.

Kimi K2‑0905 leads Creative Short‑Story Writing V3

V3 ups difficulty to 600–800 words, 18‑question rubric, power‑mean scoring, and 400 stories/LLM with seven grader models Benchmark V3. Kimi K2‑0905 posts 8.72 mean, edging GPT‑5 (8.71) and Qwen3 Max Preview (8.69); Mistral Medium 3.1 scores 8.63 Benchmark V3, Diversity views. Stronger selection and refreshed graders target style, coherence, and story requirements Benchmark V3.

Cross‑lab pilot compares jailbreak and alignment robustness

Pilot alignment tests across public models show: strong system‑prompt protection for Claude; o3/o4‑mini more jailbreak‑robust; refusal vs hallucination tradeoffs; lower scheming in o3/Sonnet 4; and grader quirks matter Findings. Both labs call for stronger cross‑lab scaffolds and harder settings going forward Cross‑lab note.

Hugging Face tracker now logs images, videos and tables

Hugging Face’s free experiment tracking library adds native logging of images, videos, tables and metrics—handy for eval artifacts, qualitative error slices and model debugging dashboards HF tracker, Echo. This tightens feedback loops for perception, speech and agent benchmarks that need rich media evidence, not just scalar scores HF tracker.

ROMA tops SEAL‑0, beating closed deep‑research systems

Recursive Open Meta‑Agent (ROMA) posts 45.6% accuracy on the real‑time SEAL‑0 benchmark, surpassing Kimi‑Researcher (36.0%), Perplexity Deep Research (31.5%), and others SEAL‑0 chart, Table. ROMA uses hierarchical/parallel task decomposition with transparent traces, aiding eval reproducibility and error analysis SEAL‑0 chart.


🎬 Generative Media: Image, Video, Audio

Packed media cycle: Seedream 4 multi‑image edit/create (4K, cheap), Veo 3 pricing cuts + 9:16, Lucy‑14B fastest i2v via fal, Ideogram V3 presets, Stable Audio 2.5 on Replicate/ComfyUI, Leonardo’s Lucid Origin. Mostly feature/pricing and hands‑on demos.

Veo 3 adds native 9:16 vertical and new 4/6/8s clip lengths amid pricing cuts

Google’s Veo 3 and Veo 3 Fast now natively output 9:16 vertical, and offer 4/6/8s durations for finer control—useful for Shorts/Reels without post-cropping FAL updateFAL durations. This lands as Veo 3 pricing was halved and Flow added multi‑aspect export, in context of Veo 3 price cut+vertical. Google highlights vertical support via Flow as well DeepMind Flow. Builders cite lower costs: $0.40/s Veo 3, $0.15/s Veo 3 Fast Pricing.

Lucy‑14B image‑to‑video launches with sub‑8‑second generations on FAL

FAL and Decart debuted Lucy‑14B, a photorealistic image‑to‑video model positioned as the fastest large i2v yet. Day‑0 availability on FAL with 10‑second generations and a fresh speed boost to sub‑8s FAL launchSpeed boost. Decart touts no quality compromise at speed Decart pitch, and FAL invites hands‑on trials Try LucyQuality claim.

Seedream 4 lands in ComfyUI and Replicate with 4K, multi‑image edit and up to nine outputs

ByteDance’s Seedream 4 rolls into ComfyUI with 4K generation, natural‑language control, up to 9 outputs and 6 inputs per run, and stronger cross‑image consistency ComfyUI releaseSequence outputs. Replicate hosting and community comparisons continue Replicate addArena add. Comes after the ~$0.03/gen debut and multi‑view/edit feature set Seedream 4 $0.03 + 4K.

Stable Audio 2.5 arrives on Replicate and ComfyUI with fast 3‑minute generation and inpainting

Stability AI’s Stable Audio 2.5 is live on Replicate (API/web) with up to 3‑minute tracks in seconds and commercial licensing Replicate. ComfyUI added a Stable Audio API node supporting multi‑part compositions, inpainting/extensions, and enterprise‑grade workflows ComfyUI releaseBlog. FAL also surfaced a hosted entry point FAL host.

Ideogram V3 style presets go live on FAL and Replicate for one‑click aesthetics

Ideogram V3’s 69 artistic style presets are now available across FAL endpoints, bringing quick, repeatable looks for image generation and design workflows FAL launchEndpoints. Replicate added matching style‑preset support for V3 models, broadening API access for developers Replicate addMore details.

Tencent tees up HunyuanImage 2.1 ComfyUI integration and lighter VRAM quantized release

Following the open 2K‑native drop, Tencent flagged ongoing ComfyUI integration work and an accelerated push for a quantized variant with lower VRAM needs; they recommend generating at native 2K for best fidelity Integration update. Official demos and repos remain available for full‑quality runs Integration update.

Leonardo’s Lucid Origin hits Replicate, emphasizing prompt adherence and clean text

Leonardo’s new Lucid Origin model is now on Replicate, marketed for strong prompt adherence and notably clean text rendering in HD outputs—handy for posters, signage, and graphic compositions Launch noteVisual examples.


🏗️ Cloud, Capacity and Economics

Big infra and economics: WSJ reports $300B OpenAI–Oracle deal, Oracle RPO/capex surge, Google Cloud $106B commitments, Perplexity $200M at $20B. Signals robust inference demand and supply races; some pricing promos (Claude Pro).

OpenAI inks ~$300B Oracle compute pact; Oracle RPO swells, stock spikes ~43%

WSJ says OpenAI signed a ~$300B, ~5‑year Oracle cloud deal requiring ~4.5 GW power, among the largest ever WSJ scoop. Oracle touted $455B contracted backlog with a pipeline to >$500B and projects $114B OCI revenue by FY’29 from $10B in FY’25 Backlog/guide. Shares jumped ~43% this week toward a ~$1T cap as AI capacity wins re-rate the stock Stock move1W chart. Oracle plans ~$35B FY capex to add regions and GPUs, leaning into inference‑driven demand Backlog/guide.

Perplexity raises $200M at ~$20B valuation; ARR near $200M

Perplexity closed a $200M round at a ~$20B valuation, just weeks after a $100M raise; total funding nears ~$1.5B TechCrunch. Reported ARR is approaching ~$200M (up from ~$150M last month), implying ~100× ARR multiple as the company scales compute‑heavy answer generation (retrieval + LLM inference) TechCrunch.

Broadcom lands ~$10B in AI ASIC orders; OpenAI mass production by 2026

Broadcom secured roughly $10B in AI ASIC orders with OpenAI cited as a key customer targeting mass production by 2026; other wins include ByteDance, Apple and xAI for 2026–27 ReportFollow‑on. The surge underscores rising demand for custom silicon as hyperscalers hedge Nvidia and chase lower $/token economics at scale Report.

NVIDIA unveils giga‑scale AI factory reference design for GW‑class sites

NVIDIA announced a partner reference design for giga‑scale “AI factories,” integrating building, power, cooling and compute with digital‑twin simulation (Omniverse/OpenUSD) to maximize watts to training/inference AI factory blueprint. A 2026 blueprint with shared APIs/assets targets coordinated design and ops, reflecting the sector’s shift to grid‑sized, tightly‑coupled AI capacity 2026 plan.

Claude Pro promo: 50% off for 3 months for new signups spotted

A Claude Pro promotion offering 50% off for 3 months (e.g., €9 vs €18/month shown) was surfaced via claude.ai/purpose, likely targeting new signups Promo sightingFollow‑up. Discounting from leading model vendors signals competitive customer acquisition amid rising usage and growing enterprise tiers.


🚀 New and Upcoming Models

Steady drumbeat of releases/teases: K2‑Think 32B reasoning, Qwen3‑Next and Qwen3‑VL pull requests, GPT‑OSS 120B in Duck.ai, nanoVLM‑230M beating SmolVLM2, EmbeddingGemma updates. Mostly model drops and ports; media models covered separately.

K2‑Think 32B hits ~2,000 tokens/sec per request

MBZUAI’s open 32B reasoning model K2‑Think is now cited at ~2,000 tok/s per request on Cerebras inference, while claiming parity vs much larger models in math/code/science Announcement. This builds on its earlier reveal of frontier‑level scores and compact design in context of K2‑Think 32B launch. Official site and HF links are live for code, weights and docs Announcement.

nanoVLM‑230M outperforms peer 250M models with tiny memory

A new tiny VLM (230M) posts higher averages than SmolVLM2‑250M across DocVQA/MME/AI2D/ChartQA/TextVQA (e.g., 53.2 vs 48.1 avg) while training on 1/4 compute Bench table. Inference runs with under 1.5GB RAM at 2048×2048 and 300 tokens, in pure PyTorch for easy tinkering Runtime; demo and scripts are provided Demo.

Transformers PR lands Qwen3‑VL 4B and 30B‑A3B (Instruct/Thinking)

A transformers PR introduces support for Qwen3‑VL, covering a 4B dense and 30B‑A3B MoE family, plus Instruct and Thinking variants aimed at stronger multimodal understanding HF PR. The configs mention MRope upgrades, DeepStacked ViT features, and text‑aligned temporal grounding for video tasks Config notes.

Hugging Face adds Qwen3‑Next support ahead of model drop

A transformers PR to add support for Qwen3‑Next is open, signaling an imminent release of Alibaba’s sparse MoE (80B with ~3B active) touted for high efficiency HF PR. The PR references the upcoming Qwen3 series repo and credits maintainers, suggesting first‑class integration on day one HF PR.

Duck.ai opens GPT‑OSS 120B to everyone, no login needed

Duck.ai’s model picker now lists GPT‑OSS 120B as an open‑weights option available free without signup, alongside other models Picker. This widens public access to a large open model for chat and testing and lowers the barrier for quick evals of 120B‑scale open inference Picker.

Duck.ai model picker

Google’s EmbeddingGemma is the top‑trending model on Hugging Face, but users should upgrade transformers to the specified version to avoid incorrect outputs HF trendingUpgrade note. The combo raises interest in retrieval/classification pipelines that depend on embedding parity across frameworks.


🧩 MCP and Interop

Major interoperability step: ChatGPT Developer Mode adds full MCP client (read/write). New servers and registry momentum, plus MCP for robotics and research context. Mostly MCP client rollout news and concrete connectors.

ChatGPT adds full MCP client with write‑actions via Developer Mode

OpenAI enabled a full Model Context Protocol client inside ChatGPT’s new Developer Mode, supporting read and write tools and custom connectors in chat. Unverified connectors are allowed with clear risk warnings (prompt injection, destructive writes), and setup lives under a new Dev Mode toggle OpenAI dev post Feature UI. Early users confirm working custom MCP servers and ask about local vs remote support; some report remote‑only for now User note Remote‑only Q. Screens and docs emphasize "use at your own risk" and detail where to enable it Settings Overview. Community leaders and infra folks mark this as a major interop step Comment Reaction.

Dev mode warning

Open‑source ROS MCP Server bridges LLMs to real robots over ROS1/2

An open ROS MCP Server lets LLMs like Claude, GPT and Gemini read sensors and send commands through ROS topics, services and actions—without modifying robot code. It supports ROS1/2, Linux/Windows/macOS, and a bidirectional rosbridge for observability; demos include Unitree Go and industrial debugging Release thread Demo. Quickstart diagrams show an MCP client→server→ROS pipeline usable from common IDEs/assistants Getting started Follow‑up.

ROS MCP diagram

Getting started diagram

Qodo Aware MCP delivers deep codebase context to agents, with free endpoint

Developers showcase Qodo Aware as a production‑ready MCP server for deep research over large repos—feeding high‑signal context into agents and speeding PR reviews. Public demos plug it into Codex; the team offers a free Aware MCP endpoint and guidance for indexing private repos Demo video Free endpoint. Reported wins include faster onboarding, better tests/docs, and improved code quality via richer retrieval Usage notes.


🛠️ Agent Engineering and Coding Tools

Heavy week on agentic coding: Replit Agent 3 autonomy demos, Claude Code Analytics API, Factory CLI + VS Code, Qodo Aware context MCP, planning prompts and workflows. Focus on long‑running autonomy and developer ergonomics.

Replit Agent 3 pushes long‑running autonomy to 200 minutes

Replit says Agent 3 is 10× more autonomous than v2, sustaining ~200‑minute coding runs with faster, cheaper scaffolding. Demos showed it building a real‑time collaborative drawing app (websockets, multi‑browser) in about an hour of agent work Replit post, Demo clip. Community reactions frame this as the current leader in the “how long without intervention?” race Take, with the time‑horizon shift highlighted by a growth chart Run lengths. Broader context notes that as runs lengthen, control loops and tooling outweigh base model choice Ops note, Comment.

Qodo Aware MCP delivers deep context retrieval for large codebases

Builders report strong results using Qodo Aware MCP to retrieve high‑quality, scoped context from large repos inside Codex, improving deep research and PR review Demo, How to try. The team offers a free MCP endpoint and self‑hosted indexing to unlock agentic workflows that depend on precise, cross‑file grounding Benefits.

Claude gains web fetch tool for in‑API browsing workflows

Anthropic’s API now includes a web fetch tool (public beta) so agents can retrieve and analyze arbitrary URLs without separate infra Launch, API example. Use cases: digging deeper on search hits, coding against API docs, extracting from spec PDFs, and customer‑link analysis Use cases, Getting started. Security notes urge domain allowlists and injection‑aware designs Safety notes.

Engineers flag Claude Code friction: approvals, reminders, context use

Users highlight confusing approval prompts when sub‑agents return findings during plan/research phases UX gripe. A timeline suggests escalating “system reminder” spam that interrupts basic ops Escalation, Details, and concerns that reminder bloat shrinks usable context and accelerates fill‑up versus alternatives Context note. A long bug‑hunt prompt workflow is circulating to compensate via structured, multi‑worktree tactics Workflow.

Plan‑first agent workflows spread: plan.md and an ‘oracle’ to lay rails

Practitioners report longer, steadier runs by front‑loading planning. A plan.md enumerating files, context loads, tests, and PR boundaries enables 30–40 minute autonomous stretches Plan.md. Amp’s ‘oracle’ sub‑agent generates implementation plans that keep execution on‑track Amp oracle. Engineers stress that extending runs shifts success from model choice to control loops and verification/backoff Long‑run ops, in context of AGENTS.md primer emphasizing repo‑level specs for agents.

Anthropic ships Claude Code Analytics API for org‑wide metrics

Anthropic released the Claude Code Analytics API to expose daily org usage without standing up full OpenTelemetry, bridging the gap between the Console dashboard and OTEL pipelines Launch, Why it exists. Docs and access are live for Claude Code teams Docs note. This targets engineering managers who need rollout tracking and adoption baselines before heavier observability stacks.

Factory CLI lands VS Code/Cursor extension with one‑click model picks

Factory’s AI CLI now integrates with VS Code/Cursor, prompting to install an extension that lets developers quickly swap among models (e.g., Opus 4.1, GPT‑5) while staying in terminal‑driven workflows Dev post. The update emphasizes smooth local loops and model agility for coding agents embedded in editors.

On this page

Executive Summary
🤖 Embodied AI and Robot Control
Open ROS MCP Server lets Claude, ChatGPT and Gemini operate ROS robots
Yuansheng Intelligent shows 21‑DOF humanoid hands with tactile skin and 30 kg payload
Texas Instruments’ Spot robot patrols RFAB to capture production data
🛡️ Safety, Security and Licensing
MIT Sloan/Safe Security: AI used in 80% of 2,800 ransomware cases; 3‑pillar defense model
ChatGPT Developer Mode enables write‑tool MCP connectors, with strong safety warnings
🎙️ Voice and Real‑Time Experiences
Microsoft puts Voice Mode and avatars on Copilot home screen, adds Private Conversations
ElevenLabs launches Voice Remixing alpha for controllable agent voices
Gemma 3n Edge app ships on‑device STT and speech‑to‑translated‑text
Kyutai DSM unlocks low‑latency streaming ASR↔TTS with open code
🧭 RAG, Retrieval and Document Pipelines
Comparative classroom RAG: vector search vs GraphRAG with a router
Claude API adds native web fetch for agentic retrieval
LlamaParse adds PowerPoint speaker notes parsing
WFGY ships prompt‑only semantic firewall for RAG stability
Chroma launches Package Search for dependency code retrieval
V4 multimodal embeddings land in llama.cpp with near PyTorch parity
🧮 Chips and Accelerators
NVIDIA Blackwell Ultra sets new MLPerf inference records
NVIDIA Rubin CPX debuts as a dedicated context‑phase GPU with 128GB GDDR7
Broadcom secures ~$10B AI ASIC orders with OpenAI mass production by 2026
DCQCN wins SIGCOMM Test of Time for RDMA at AI scale
🧪 Training, RL and Reasoning Methods
Parallel‑R1 trains parallel thinking via a mid‑training exploration scaffold
TraceRL rewards real decoding steps; TraDo‑4B/8B beat strong 7B/8B baselines
Baichuan’s DCPO lifts math accuracy by stabilizing RLHF and boosting exploration
ByteDance’s AgentGym‑RL unifies multi‑turn agent training without SFT
Running GRPO on DSPy programs is now a practical path to RL over agents
RL beats SFT for open‑web deep research: data curation, rewards, coordination surveyed
Reverse‑Engineered Reasoning recovers plans from good outputs to train writers
⚙️ Serving, Weight Updates and Runtime
Moonshot open-sources checkpoint‑engine for near‑sync RL weight updates at trillion‑parameter scale
Dynamo recipe splits prefill and decode pools on GKE with H200s and vLLM for cheaper, faster serving
llama.cpp gains multimodal V4 embeddings with GGUF after vision and attention fixes
vLLM details deterministic inference and releases a live internals notebook
📈 Evals, Leaderboards and Observability
OpenAI Evals adds native audio inputs and graders
SimpleQA Verified leaderboard crowns Gemini 2.5 Pro
Physics‑IQ finds video models weak on real‑world physics
Anthropic ships Claude Code Analytics API for org metrics
Kimi K2‑0905 leads Creative Short‑Story Writing V3
Cross‑lab pilot compares jailbreak and alignment robustness
Hugging Face tracker now logs images, videos and tables
ROMA tops SEAL‑0, beating closed deep‑research systems
🎬 Generative Media: Image, Video, Audio
Veo 3 adds native 9:16 vertical and new 4/6/8s clip lengths amid pricing cuts
Lucy‑14B image‑to‑video launches with sub‑8‑second generations on FAL
Seedream 4 lands in ComfyUI and Replicate with 4K, multi‑image edit and up to nine outputs
Stable Audio 2.5 arrives on Replicate and ComfyUI with fast 3‑minute generation and inpainting
Ideogram V3 style presets go live on FAL and Replicate for one‑click aesthetics
Tencent tees up HunyuanImage 2.1 ComfyUI integration and lighter VRAM quantized release
Leonardo’s Lucid Origin hits Replicate, emphasizing prompt adherence and clean text
🏗️ Cloud, Capacity and Economics
OpenAI inks ~$300B Oracle compute pact; Oracle RPO swells, stock spikes ~43%
Perplexity raises $200M at ~$20B valuation; ARR near $200M
Broadcom lands ~$10B in AI ASIC orders; OpenAI mass production by 2026
NVIDIA unveils giga‑scale AI factory reference design for GW‑class sites
Claude Pro promo: 50% off for 3 months for new signups spotted
🚀 New and Upcoming Models
K2‑Think 32B hits ~2,000 tokens/sec per request
nanoVLM‑230M outperforms peer 250M models with tiny memory
Transformers PR lands Qwen3‑VL 4B and 30B‑A3B (Instruct/Thinking)
Hugging Face adds Qwen3‑Next support ahead of model drop
Duck.ai opens GPT‑OSS 120B to everyone, no login needed
EmbeddingGemma is trending; upgrade transformers to get correct results
🧩 MCP and Interop
ChatGPT adds full MCP client with write‑actions via Developer Mode
Open‑source ROS MCP Server bridges LLMs to real robots over ROS1/2
Qodo Aware MCP delivers deep codebase context to agents, with free endpoint
🛠️ Agent Engineering and Coding Tools
Replit Agent 3 pushes long‑running autonomy to 200 minutes
Qodo Aware MCP delivers deep context retrieval for large codebases
Claude gains web fetch tool for in‑API browsing workflows
Engineers flag Claude Code friction: approvals, reminders, context use
Plan‑first agent workflows spread: plan.md and an ‘oracle’ to lay rails
Anthropic ships Claude Code Analytics API for org‑wide metrics
Factory CLI lands VS Code/Cursor extension with one‑click model picks