NVIDIA Rubin NVL72 hits 10× tokens per MW – 75% fewer GPUs

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

NVIDIA turned its earlier Rubin teaser into hard economics at CES: Vera Rubin NVL72 is now framed as a rack‑scale “tokens‑per‑watt” machine, with a 10T MoE training in a month on roughly 75% fewer GPUs than Blackwell and a 1 MW cluster serving about 10× more tokens per second at steady state. Each NVL72 rack delivers 3.6 EFLOPs NVFP4 inference and 2.5 EFLOPs training with 54 TB LPDDR5X plus 20.7 TB HBM4 (1.6 PB/s), while BlueField‑4‑backed KV‑cache sharing reportedly yields up to 5× better long‑context throughput. Jensen Huang also confirmed 45°C warm‑water DLC with no chillers; Bloomberg tracked sharp same‑day drops in Johnson Controls (‑11%) and Modine (‑21%) as markets priced in thinner AI chiller demand. Runway’s Gen‑4.5 video model moved from Hopper to Rubin NVL72 in a single day, signaling CUDA‑compatible tooling is ready.

• Model evals and cost optics: Artificial Analysis’ Intelligence Index v4.0 now ranks GPT‑5.2 ahead of Opus 4.5 and Gemini 3 Pro, while a cost chart puts a full v4.0 run at $2,930 for GPT‑5.2 vs $1,590 for Opus and ~$600–$1,000 for leading open weights.
• Voice agents and capital: Nemotron Speech + Nemotron 3 Nano + Magpie demonstrate fully open sub‑500 ms voice‑to‑voice stacks; Modal sustains 127 concurrent streams on one H100. xAI closed a $20B Series E with NVIDIA and Cisco, funding Grok 5 on Colossus clusters already exceeding 1M H100‑equivalent GPUs.

Together these moves tighten the link between rack‑level efficiency, eval‑measured capability, and capital intensity at the top of the AI stack, while open speech and agent tooling show how quickly those gains are being productized in real‑time interfaces.

Feature: NVIDIA Rubin resets inference economics

Rubin NVL72 promises ~10× cheaper tokens and warm‑water (45°C) DLC; partners port in a day; cooling vendors’ stocks drop—datacenter and AI cost curves shift now.

Strong, cross‑account CES coverage adds new specifics: 10× lower token cost and warm‑water DLC at 45°C, full‑stack NVL72 specs, and day‑one partner ports. This continues yesterday’s Rubin storyline with concrete cooling and adoption impacts.

Jump to Feature: NVIDIA Rubin resets inference economics topics

🧊 Feature: NVIDIA Rubin resets inference economics

Rubin NVL72 details sharpen 10× tokens-per-MW and 75% fewer GPUs claims

Rubin NVL72 (NVIDIA): NVIDIA expanded on Rubin NVL72’s economics at CES, saying a 10T MoE can be trained in one month with 75% fewer GPUs than Blackwell and that a 1 MW cluster can now serve about 10× more tokens per second, tightening the story from initial Rubin which outlined the first throughput and cost claims; the company also quantified rack-level compute at 3.6 EFLOPs NVFP4 inference and 2.5 EFLOPs NVFP4 training, with 54 TB LPDDR5X plus 20.7 TB HBM4 delivering 1.6 PB/s and 260 TB/s scale-up bandwidth across the NVL72 system as described in the nvl72 spec tweet and the rubin blog recap.

• Training and inference economics: NVIDIA frames Rubin as a tokens-per-watt machine rather than a raw FLOPS play, emphasizing that a 10T MoE which previously needed a much larger Blackwell fleet can now train on a quarter of the GPUs while serving 10× more tokens at the same 1 MW power envelope, according to the rubin blog recap and a separate rubin summary.
• Six-chip rack-scale stack: The Vera Rubin NVL72 integrates a Vera CPU (88 Olympus cores, 1.2 TB/s DRAM, up to 1.5 TB LPDDR5X), Rubin GPUs, NVLink 6 switches, ConnectX‑9 SuperNICs, BlueField‑4 DPUs, and Spectrum‑6 Ethernet so that compute, networking, and control behave as a single logical system, with NVIDIA contrasting this tightly-owned stack against more partner-driven rack designs from AMD’s Helios and Huawei in the rubin platform thread.
• KV‑cache reuse and long context: Rubin’s Inference Context Memory Storage Platform, powered by BlueField‑4, adds a shared AI-native key–value cache so that attention state can be reused across requests, which NVIDIA says can increase tokens-per-second throughput and power efficiency by up to 5× on long-context workloads, as detailed in the rubin blog recap and the nvidia blog.

The picture is that Rubin is being positioned not as a single GPU jump but as a rack-scale, cache-aware inference factory where both hardware and system software are tuned around tokens-per-watt rather than peak TFLOPS.

NVIDIA Rubin NVL72 hits 10× tokens per MW – 75% fewer GPUs

Executive Summary

Top links today

Feature: NVIDIA Rubin resets inference economics

Table of Contents

🧊 Feature: NVIDIA Rubin resets inference economics

Rubin NVL72 details sharpen 10× tokens-per-MW and 75% fewer GPUs claims

45°C warm‑water cooling for Rubin hits chiller vendors’ stock prices

Runway ports Gen‑4.5 video model to Vera Rubin NVL72 in one day

🛠️ Agent‑native coding: workflows, UIs and local runs

Claude Desktop adds local Claude Code so you can run agents without a terminal

Cursor’s dynamic context turns everything into files and cuts tokens by ~47%

Cline 3.47.0 adds Background Edits and spotlights MiniMax M2.1 coding model

OpenCode Black $200 "any model" coding tier sells out in an hour

Rork leans on Claude Code and Opus 4.5 to ship mobile apps in three clicks

🔌 Skills, MCP and agent‑to‑agent plumbing

Anthropic Agent Skills framing spreads as builders pivot from GPTs to reusable skills

MCP Agent Mail emerges as popular open mailbox for agent‑to‑agent chat

Cursor team defends filesystem‑based MCP discovery as users call for pinning

dotagents TUI consolidates Claude Code skills, hooks and commands into .agents

📊 Artificial Analysis Index v4.0 and eval cost optics

GPT‑5.2 tops Artificial Analysis Intelligence Index v4.0

AA‑Omniscience decouples LLM accuracy from hallucination behavior

CritPT physics benchmark exposes limits of frontier reasoning

GDPval‑AA ranks GPT‑5.2 and Opus 4.5 on real economic tasks

Artificial Analysis publishes dollar cost to run its full Index v4.0

🧠 Open/edge models and coding specialists

Liquid AI releases LFM2.5 open-weight edge model family

LG AI details K‑EXAONE: 236B MoE with 23B active and 256K context

NousCoder‑14B hits 67.87% Pass@1 with fully open RL stack

Mystery "Goldfish" model impresses in Arena image comparisons

🗣️ Sub‑500ms voice stacks with NVIDIA Nemotron Speech

Nemotron Speech ASR stack hits <500ms voice‑to‑voice with fully open models

Modal benchmarks Nemotron Speech ASR at 127 concurrent streams on one H100

🔎 Binary + int8 rescoring for 40M‑doc, 200ms search

Binary + int8 quantized retrieval hits 200ms search over 40M docs on CPU

🎬 Open video/audio tooling and creator controls

LTX‑2 open audio‑video model gets NV‑optimized ComfyUI support

Higgsfield Relight adds studio‑grade lighting control to static images

Freepik Nano Banana + Kling pipeline recreates a Zelda trailer on $300

Genspark AI Image adds "Mark to Edit" region‑aware editing

Cursor + Nano Banana + Tripo demo shows 1‑day 3D asset pipeline

Vector‑to‑plush workflow turns Illustrator art into product shots with Nano Banana

🤖 Humanoids and AV reasoning at CES pace

Atlas humanoid gets production specs and Hyundai factory trial timeline

NVIDIA’s Alpamayo AV stack explains its own driving decisions in real time

NVIDIA lines up a full physical‑AI stack with major robot partners at CES

Reachy Mini home robot sees 4,900% new‑customer spike after CES visibility

Italian startup Generative Bionics debuts GENE.01 humanoid with "physical AI"

💸 Capital and adoption: xAI and LMArena signal scale

xAI closes oversubscribed $20B Series E to fund Grok 5 and massive compute

LMArena lands $150M Series A at $1.7B valuation and $30M ARR run rate

Alexa+ comes to Alexa.com, turning Amazon’s assistant into a cross‑platform web agent

Samsung plans 800M Galaxy AI devices in 2026, doubling AI footprint

ChatGPT Go is free for a year in India, expanding paid‑tier features

ElevenLabs voice agents now assist nearly half of Cars24 sales calls

📑 Reasoning methods, verification, and visual math

Geometry of Reason uses spectral attention signatures to verify math reasoning without training

CogFlow pipeline and MATHCOG dataset tackle diagram‑based math via perception→internalization→reasoning

Universal Weight Subspace Hypothesis finds shared low‑rank directions across 1,100+ networks

LAMER meta‑RL teaches language agents to explore across repeated attempts

⚙️ Serving speed and compatibility upgrades

Nemotron Speech ASR stack hits sub‑second latency for 127 streams

vLLM‑Omni v0.12.0rc1 pushes production‑grade multimodal serving

Binary+int8 retrieval serves 40M texts in 200ms on a CPU

LTX‑2 gets NVFP4/FP8 checkpoints and day‑0 ComfyUI support

NVIDIA outlines major speedups for open AI tools on RTX PCs

OpenRouter’s model rankings page halves blocking time with lazy loading

Tencent’s HY‑World 1.5 adds faster inference and a 5B lite model

Vercel AI SDK 6.0.12 adds programmable image model middleware

On this page