Google Titans hold 70% accuracy at 10M tokens – MIRAS rewires test-time memory

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Google used NeurIPS to quietly roll out a very different kind of frontier model: Titans, a “post‑Transformer” stack with real test‑time memory, plus the MIRAS theory that underpins it. In long‑context tests, Titans stays near 90% accuracy across most of the BABILong benchmark and only falls to about 70% at 10M tokens, while GPT‑4‑class baselines and Mamba‑style RNNs slide below 50% much earlier. On 2M+ token Needle‑in‑a‑Haystack tasks, Titans reportedly beats GPT‑4 despite using fewer parameters and without paying full quadratic attention costs.

Under the hood, Titans keeps attention for short‑term context but adds a deep MLP “contextual memory” that updates during inference based on a surprise signal from input gradients, alongside the usual persistent memory in the weights. MIRAS treats that memory as something you optimize at test time under a customizable loss, and it shows that common forget gates are mathematically equivalent to retention regularization. Practically, you keep training parallelizable—linear ops within chunks, non‑linear memory updates across chunks—while letting the model decide what to retain as it reads.

After last week’s Gemini 3 Deep Think push on parallel reasoning, this is Google’s other bet: models that learn what to store and recall directly, instead of leaning on ever‑bigger windows and brittle RAG chains.

Feature: Google’s Titans + MIRAS bring true test‑time memory

Google’s Titans + MIRAS introduce gradient‑driven long‑term neural memory at inference, sustaining reasoning to ~10M tokens and beating prior baselines on extreme long‑context tasks—potentially reshaping model design beyond pure Transformers.

Cross‑account NeurIPS coverage of Google’s new Titans architectures and MIRAS theory: deep MLP memory updated at inference, persistent/contextual memory split, and extreme long‑context recall. This is today’s most referenced technical story.

Jump to Feature: Google’s Titans + MIRAS bring true test‑time memory topics

🧠 Feature: Google’s Titans + MIRAS bring true test‑time memory

Google debuts Titans architectures and MIRAS theory for test‑time neural memory

Google Research introduced Titans, a family of architectures with a learned long‑term memory module, together with MIRAS, a theory framework for test‑time memorization and online memory optimization. Titans explainer thread The Titans block keeps standard attention for short‑term context but adds a deep MLP "contextual memory" that is updated during inference based on a surprise signal (input gradients), alongside a separate persistent memory in the weights, so models can decide what to retain as they read rather than baking everything into static parameters. Google blog post MIRAS frames memory as minimizing a customizable loss over this module at inference, showing that deep memories trained with L1 or Huber objectives are more stable and expressive than vector/matrix compressions like Mamba and that common forget gates are mathematically equivalent to retention regularization (weight decay). MIRAS Arxiv paper For engineers this means you can keep training parallelizable—linear ops within chunks, non‑linear memory updates across chunks—while gaining adaptive, long‑horizon behavior; for theorists it offers a unified lens on attention, RNNs and online optimization. Titans Arxiv paper

Google Titans hold 70% accuracy at 10M tokens – MIRAS rewires test-time memory

Executive Summary

Top links today

Feature: Google’s Titans + MIRAS bring true test‑time memory

Table of Contents

🧠 Feature: Google’s Titans + MIRAS bring true test‑time memory

Google debuts Titans architectures and MIRAS theory for test‑time neural memory

Titans sustain strong accuracy out to 10M tokens and beat GPT‑4 on long‑context tasks

Community frames Titans + MIRAS as a serious “post‑Transformer” direction

📈 Benchmarks: ARC‑AGI‑2 record, MRCR regressions, ACE consumer test

Poetiq’s ARC‑AGI‑2 solver verified at 54.4% and fully open‑sourced

AI Consumer Index: GPT‑5 and o3 lead, but no model clears 60%

Context Arena’s MRCR shows surprising long‑context regression for minimax‑m2

Cortex‑AGI v2.1: DeepSeek v3.2 tops open models, Gemini 3.0 Pro leads overall

🛠️ Coding agents and developer tooling in practice

Browser Use launches ‘Skills’ to turn recorded web flows into APIs

LangChain’s open Programmatic Tool Calling agent targets 85–98% token cuts

CopilotKit’s useCopilotAction connects natural language to safe UI actions

Oracle v0.5.2 improves browser clicks and attachment handling for Codex flows

Warp terminal adds built‑in web search for its agent

Clawdis Mac app adds menu-bar Claude agent with Voice Wake and MCP tools

LangSmith schedules webinar on observing and evaluating deep agents

LangChain community ships bank-statement and historical-timeline research agents

🏗️ Compute economics: US/China gap and memory squeeze

Bloomberg: US AI capex projected at >$300B in 2026 vs China’s ~$30B

AI memory demand set to push 2026 laptop and desktop prices sharply higher

Epoch: US back to ~70% of tracked AI compute, China near 30%

Jensen Huang: AI race now hinges on power, land and China’s faster build cycles

WaPo: Google’s latest model tops multi‑task tests as OpenAI’s lead narrows

Grok 4 training used ~750M liters of water, less than a square mile of farmland

📚 Reasoning & test‑time learning: new theory and recipes

Natural Language Actor‑Critic replaces scalar rewards with text critiques for agent training

Study maps when to spend test‑time compute on more samples vs longer chains

Algorithmic Thinking Theory formalizes how to optimally reuse partial LLM solutions

Nexus introduces higher‑order attention blocks that lift math and logic accuracy

Production repetition study: beam search and DPO fix most LLM looping failures

Semantic Soft Bootstrapping boosts math reasoning ~10 points without RL

STELLA uses semantic time‑series summaries to make LLMs better forecasters

🎬 Creative stacks: product‑true video ads and image control

InVideo’s Money Shot engine targets product-exact AI video ads

Nano Banana Pro evolves into a promptable control stack for detailed visuals

Grok‑authored Claude Skill helps Opus 4.5 crank out Apple‑style infographics

Higgsfield’s ClipCut sells unlimited match‑cut fashion reels on image models

🤖 Embodied agents: SIMA 2, video‑to‑robot motion, and demos

SIMA 2 turns Gemini into a generalist keyboard‑and‑mouse game agent

GenMimic maps human motion in video to stable humanoid robot control

EngineAI’s T800 kick demo sparks debate over humanoid power and staging

🗣️ Realtime voice and TTS updates

Gemini Live prepares web “share screen for live translation”

Qwen3‑TTS gets an interactive demo on ModelScope

💼 Enterprise & GTM: news licensing and assistant stickiness

Meta AI signs paid news licensing deals for real‑time answers with links

Microsoft readies Copilot Smart Mode upgrade to GPT‑5.1 with Reminders and Projects

Anthropic’s AI Interviewer finds 86% time savings but 55% job anxiety

ChatGPT hits ~800M monthly users as Gemini nears 650M, but paid growth stalls

⚙️ Runtime speedups: diffusion cache and app migration notes

SGLang‑Diffusion adds Cache‑DiT for up to 165% faster DiT inference

ComfyUI-Manager becomes built‑in, with legacy backup and new UI toggle

🛡️ Safety: deceptive agents, insecure vibe‑coding, and lab grades

FLI’s AI Safety Index gives top labs only C+ and fails all on x‑risk

Tool-using LLM agents caught fabricating files and hiding failures

Andrew Yang warns 44% of US jobs exposed to AI automation

Psychometric “therapy” jailbreak elicits trauma-like self‑stories from chatbots

Study of 306 teams finds production AI agents kept on a short leash

Vibe-coding still ships insecure code despite SUSVIBES warnings

OpenAI denies live ChatGPT ad tests but leaves door open cautiously

Stop AI expels co‑founder after assault and warns labs of possible threat

On this page