NVIDIA NVARC 4B slashes ARC‑AGI‑2 costs – $0.20 per task

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Today’s action shifted to ARC‑AGI‑2, where we now have both a fully documented SOTA stack and a shockingly cheap small‑model entrant. ARC Prize has published Poetiq’s Gemini 3 Pro refinement pipeline, locking in ~54% accuracy at about $30.57 per puzzle with open code, while leaked charts show an internal GPT‑5.2 point hovering near the 50% band—nicely in line with the 5.2 rumors we covered yesterday. At the other end of the curve, NVIDIA’s NVARC team is touting a custom 4B model at roughly 27.6% for $0.20 per task, reshaping the score‑per‑dollar frontier.

The NVARC recipe leans on synthetic ARC‑style curricula, heavy test‑time training, and NeMo tooling so a tiny model can adapt per instance instead of hauling around a giant frozen prior. In practical terms, you can copy Poetiq’s high‑budget Gemini refinement stack for maximum score, or borrow NVARC’s “adaptive reasoner” pattern to bolt cheap specialists onto your existing systems; most teams will end up doing both, depending on workload.

Practitioners are also pushing back on treating ARC‑AGI‑2 as a universal quality stamp: some mid‑pack models reportedly feel better in daily use than leaderboard stars. Treat this benchmark as a sharp instrument for reasoning and cost curves, then validate with your own tools, latency budgets, and live agent runs before you anoint a new house model.

Feature: ARC‑AGI‑2 surge—SOTA and cheap solvers

ARC‑AGI‑2 escalates: Poetiq hits a verified 54% at ~$31/task and NVIDIA’s 4B model scores ~27.6% at $0.20—showing both ceiling and cost‑floor shifts in general reasoning.

Many tweets center on ARC‑AGI‑2: a verified SOTA above 50% and a ultra‑cheap small model entrant. This is the clearest cross‑account storyline today for engineers tracking general reasoning progress and cost curves.

Jump to Feature: ARC‑AGI‑2 surge—SOTA and cheap solvers topics

📈 Feature: ARC‑AGI‑2 surge—SOTA and cheap solvers

Poetiq’s Gemini 3 Pro refinement locked in as reproducible ARC‑AGI‑2 SOTA

ARC Prize has now published the full setup for Poetiq’s Gemini 3 Pro–based refinement stack, confirming ~54% ARC‑AGI‑2 accuracy at a reported $30.57 per problem and making the solver reproducible via open code. (Poetiq SOTA, arcprize update, open repo) On current public charts this “Gemini 3 Pro (Ref.)” point anchors the top‑right of the cost–score frontier, well above baseline Gemini and most competing models but at a notably higher per‑task price. leaderboard thread

For builders, the new detail is practical: you can now study and run the exact refinement pipeline, not just the headline score. That means you can inspect prompt structure, search strategy, and tool usage, then decide whether to (a) copy the approach to your own Gemini‑class models, (b) adapt it to other providers, or (c) treat it as a reference design for high‑budget reasoning systems. The downside is obvious—~$30 per puzzle is too expensive for most production use—but as with AlphaGo‑era systems, SOTA stacks like this often get distilled or approximated into cheaper variants later. If you care about general reasoning research, this is the blueprint to read before you try to beat it.

NVIDIA NVARC 4B slashes ARC‑AGI‑2 costs – $0.20 per task

Executive Summary

Top links today

Feature: ARC‑AGI‑2 surge—SOTA and cheap solvers

Table of Contents

📈 Feature: ARC‑AGI‑2 surge—SOTA and cheap solvers

Poetiq’s Gemini 3 Pro refinement locked in as reproducible ARC‑AGI‑2 SOTA

NVIDIA’s 4B NVARC model hits ~27.6% ARC‑AGI‑2 at $0.20 per task

Leaked ARC‑AGI‑2 plots show a GPT‑5.2 point near 50% accuracy

Builders push back on using ARC‑AGI‑2 as a single quality proxy

🛰️ Frontier model watch: Gemini 3 Flash, Grok 4.20, Nano Banana 2

Gemini 3 Flash quietly appears on LM Arena for head‑to‑head testing

Elon targets Grok 4.20 release in 3–4 weeks, Grok 5 likely slips

Leak points to Nano Banana 2 Flash with near‑Pro quality at lower cost

Rnj‑1 touted as first truly open 8B in the GPT‑4o tier

🧩 Agent ops: MCP skills, LangGraph flows, PM integration

Claude Skills formalize reusable tools as markdown plus SDK support

LangChain shares concrete evaluation patterns for Deep Agents and Terminal Bench

LangGraph’s Energy Buddy shows hybrid agent design with OCR and ReAct

LlamaAgents ship configurable invoice and contract agents as LlamaCloud templates

Open Deep Research publishes detailed LangGraph blueprint for research agents

Indie builder prototypes a beads‑based memory system for coding agents

Speechmatics positions real‑time diarization as a core primitive for voice agents

Warp adds AI model profiles with per‑profile permissions and routing

(omitted)

(omitted)

⚙️ TPUs and accelerator economics: Ironwood, pricing gaps

Analyst: TPU v6/v7 undercut Nvidia GB200/300 by 20–50% per useful FLOP

Anthropic’s reported $52B TPUv7 buy points to sovereign TPU data centers

Google details Ironwood TPU pods at 42.5 FP8 ExaFLOPS

🧠 HBM is the bottleneck: memory‑centric AI systems

HBM capex shift drives sharp DDR price hikes for AI

LLM inference shown to be HBM‑bound rather than FLOP‑bound

Micron to spend ~$9.6B on new HBM fab in Hiroshima

🛡️ Safety & security: medical harm, jailbreak defenses, JS vulns

NOHARM benchmark shows severe medical harms in ~22% of LLM consults

ARENAJS finds LLMs brittle at real-world JavaScript vulnerability detection

Three-layer defense stack cuts jailbreak attacks to near zero in tests

🏗️ Power and scale: DC buildouts and grid pressure

US AI data center load seen hitting ~106 GW by 2035 as 1–5 GW campuses proliferate

Elon Musk pitches 100 GW per year of AI compute from solar‑powered satellites

🎬 Creative stacks: NB Pro grids, Kling editing, one‑shot ads

Nano Banana Pro 4-step grid workflow nails cinematic character shots

InVideo’s Money Shot tool one-shots spec ads and launches $25k contest

Kling O1 becomes a first-class ComfyUI partner for reference editing

NB Pro shows deep game-world knowledge with Zelda creature renders

💼 Enterprise & market moves: platform controls and usage

Anthropic moves to own the dev tooling stack with Bun buy and more M&A

OpenAI disables ChatGPT in‑chat app suggestions after “ad” backlash

Anthropic worker study: 69% hide AI use from employers despite daily reliance

Australia unveils $460M National AI Plan spanning infra, skills and safety

New MAU table shows Gemini closing on ChatGPT while Copilot usage stays flat

Baidu cuts legacy units after 11.23B yuan loss while protecting AI talent

US community colleges race to teach AI as basic job safety skill

Reddit struggles as AI-generated “slop” floods top subreddits with no reliable filters

🧠 Agent learning by doing: world models, curiosity, co‑improvement

CuES turns tool environments into self‑generated RL curricula

WMAct teaches LLM agents compact world models via multi‑turn interaction

Meta argues AI–human co‑improvement beats fully autonomous self‑improvement

Agentic LLMs survey maps reasoning–acting–interacting and data-from-use

🧪 Training efficiency: Miles + FSDP2 backend

Miles adds FSDP2 backend aligned with Megatron and ready for Qwen3-4B

💹 Finance agents: orchestration and semantic market links

Agentic multi‑agent trading framework beats S&P 500 and Bitcoin baselines

Semantic Trading agent finds profitable links across prediction markets

🤖 Robotics in the field: construction, weeding, and mechs

GITAI robots autonomously assemble 5‑meter tower for off‑world habitats

Tsubame’s 4.5 m ARCHAX mech demonstrates piloted humanoid robot for heavy work

LaserWeeder G2 shows chemical‑free, vision‑guided weed removal in the field

🗣️ Realtime voice agents: UX and pipeline hints

Gemini Live web adds ‘share screen for live translation’ entry point

Grok voice mode emerges as fastest, most natural option for tasking

On this page