Google’s BATS lifts BrowseComp accuracy to 24.6% – cuts agent cost 31% feature image for Sun, Dec 14, 2025

Google’s BATS lifts BrowseComp accuracy to 24.6% – cuts agent cost 31%

Stay in the loop

Free daily newsletter & Telegram daily report

Join Telegram Channel

Executive Summary

Google’s new BATS framework is the first agent paper this week that feels directly aimed at your cloud bill. The lightweight Budget Tracker plugin alone hits ReAct‑level accuracy on web tasks using 10 instead of 100 tool calls and trims overall cost by about 31%, simply by exposing live “query budget remaining” counters inside the agent’s reasoning loop.

Full BATS orchestration pushes harder on quality. On BrowseComp, a Gemini‑2.5‑Pro agent with BATS scores 24.6% vs ReAct’s 12.6% under the same 100‑tool cap; BrowseComp‑ZH jumps to 46.0% vs 31.5%, and HLE‑Search to 27.0% vs 20.5%. Planning and self‑verification both become budget‑aware, and the paper introduces a unified cost metric that blends token spend and tool‑call prices so you can reason in dollars, not abstract “steps.”

DAIR.AI is already teaching BATS in its agent courses, which is usually a sign a pattern is graduating from research toy to production norm. Paired with Artificial Analysis’ token‑usage charts showing GPT‑5.2‑xhigh burning nearly 2× Sonnet’s tokens for similar work, the direction of travel is clear: agents that ignore budgets are going to feel as dated as models that ignore context windows. Time to make “budget‑aware by design” part of your standard agent spec.

Top links today

Feature Spotlight

Feature: Budget‑aware agent scaling (BATS)

Google’s BATS makes agents budget‑aware, doubling BrowseComp accuracy vs ReAct under equal budgets and hitting ReAct‑level accuracy with 10× fewer tool calls—clear design guidance for reliable, cheaper web agents.

Cross‑account coverage converges on Google’s BATS/Budget Tracker paper showing that making agents explicitly aware of tool‑call budgets lifts accuracy and slashes cost; mostly web‑agent results and concrete deltas vs ReAct.

Jump to Feature: Budget‑aware agent scaling (BATS) topics

Table of Contents

🎯 Feature: Budget‑aware agent scaling (BATS)

Google’s BATS makes web agents budget‑aware, slashing tool use and cost


📊 Frontier scoreboards and token economics

AA‑Omniscience Index finds Gemini and Opus more reliable than GPT‑5.2

Artificial Analysis Index ties Gemini 3 and GPT‑5.2 at the top

AA token-usage chart exposes big efficiency gaps between frontier models

CritPt benchmark shows GPT‑5.1 and GPT‑5.2 at 0% on this suite


🧰 Skills, agent stacks and coding workflows

Codex CLI gets experimental “skills” with AGENTS.md and SKILL.md patterns

Azure AI samples repo connects local Ollama and LangChain.js to Azure serverless RAG

LangChain community ships AI Travel Agent template with six tools and cloud deploys

Synapse Workflows shows a LangGraph multi‑agent stack for search, productivity and data

Cursor vs Droid comparison shows how IDE instructions quietly change Codex behavior

JustHTML becomes a flagship case study for serious agent‑assisted coding

Unix-style composable tools emerge as a pattern for coding agent stacks

Warp terminal lets agents pull live command output via @‑references

Yutori’s Scouts opens up a push‑based browser research agent built on SoTA web use

FAANG engineer outlines a realistic AI‑infused coding workflow from design doc to prod


🧪 Agentic coding: from papers to production‑grade code

DeepCode agent surpasses human PhDs and commercial tools at paper-to-code

Chain of Unit-Physics bakes physics tests into multi-agent code generation


🧠 Long‑context and deterministic verification advances

BEAVER offers deterministic safety bounds for LLM rule‑following

RoPE++ keeps imaginary attention to halve KV cache with 64k+ context


🏗️ AI infra cycle, memory bottlenecks and export risk

AI-driven semiconductor “giga cycle” pushes chips toward ~$1T and HBM toward $100B

US bill seeks 30‑month halt on Nvidia H200 export licences to China

Broadcom’s $11.1B quarter underscores how AI pays for the “plumbing”

Bond report projects “AI era” to reach tens of billions of edge devices

Starcloud and Google explore space-based data centers for AI compute


🗣️ Builders’ stacks, long‑context warnings and UX pain points

Builders lean into multi-model stacks with Opus 4.5, GPT‑5.2 and Gemini

Amp debates whether 1M‑token threads help more than they hurt

Claude’s chat compaction and file UX spark pushback from knowledge workers

GPT‑5.2 Pro’s Extended and xhigh modes trade latency for reliability


💼 Market maps and enterprise platform signals

Updated open‑model tier list crowns DeepSeek, Qwen, Kimi as frontier labs

Google’s Antigravity IDE quietly ships as a free, agent‑first editor

Microsoft Copilot teases 2025 Flight Log and Smart Plus GPT‑5.2 mode

Statista map shows Beijing and Silicon Valley dominate AI share of VC

US computing patents classified as G06 spike after ChatGPT era


🎬 Creative pipelines: Gemini Flash, NB Pro grids, and Retake

Gemini 3 Flash spins an entire animated “video” from a single HTML prompt

3×3 grids evolve into multi-model pipelines for AI “cinematography”

LTX Retake turns 20-second clips into new shots with one prompt

DesignArena’s new Lotus and Cactus image models appear to be OpenAI GPT‑4

Nano Banana Pro gets a "cloud pareidolia" recipe for pseudo‑photography

Nano Banana Pro is being used for full literary spreads like A Christmas Carol

Horror and sci‑fi merchants are prototyping detailed pillow designs with NB Pro


🤖 Embodied: open robots ship, lab rigs get game‑ready

Reachy Mini robots start landing on desks as open-source AI hardware

Doom-playing rat rig evolves into a richer embodied learning setup

Gemini Live drives Stanford’s Puppers robot dog in new demo


🎙️ Realtime voice: Translate and site voice agents

Gemini speech-to-speech in Google Translate headed to devs next year

Solo consultant shows ElevenLabs voice agent wired into RAG and n8n CRM

Builders report noticeable Gemini Flash native audio quality jump

Community project brings Gemini Native Audio translation into the web stack

On this page

Executive Summary
Feature Spotlight: Feature: Budget‑aware agent scaling (BATS)
🎯 Feature: Budget‑aware agent scaling (BATS)
Google’s BATS makes web agents budget‑aware, slashing tool use and cost
📊 Frontier scoreboards and token economics
AA‑Omniscience Index finds Gemini and Opus more reliable than GPT‑5.2
Artificial Analysis Index ties Gemini 3 and GPT‑5.2 at the top
AA token-usage chart exposes big efficiency gaps between frontier models
CritPt benchmark shows GPT‑5.1 and GPT‑5.2 at 0% on this suite
🧰 Skills, agent stacks and coding workflows
Codex CLI gets experimental “skills” with AGENTS.md and SKILL.md patterns
Azure AI samples repo connects local Ollama and LangChain.js to Azure serverless RAG
LangChain community ships AI Travel Agent template with six tools and cloud deploys
Synapse Workflows shows a LangGraph multi‑agent stack for search, productivity and data
Cursor vs Droid comparison shows how IDE instructions quietly change Codex behavior
JustHTML becomes a flagship case study for serious agent‑assisted coding
Unix-style composable tools emerge as a pattern for coding agent stacks
Warp terminal lets agents pull live command output via @‑references
Yutori’s Scouts opens up a push‑based browser research agent built on SoTA web use
FAANG engineer outlines a realistic AI‑infused coding workflow from design doc to prod
🧪 Agentic coding: from papers to production‑grade code
DeepCode agent surpasses human PhDs and commercial tools at paper-to-code
Chain of Unit-Physics bakes physics tests into multi-agent code generation
🧠 Long‑context and deterministic verification advances
BEAVER offers deterministic safety bounds for LLM rule‑following
RoPE++ keeps imaginary attention to halve KV cache with 64k+ context
🏗️ AI infra cycle, memory bottlenecks and export risk
AI-driven semiconductor “giga cycle” pushes chips toward ~$1T and HBM toward $100B
US bill seeks 30‑month halt on Nvidia H200 export licences to China
Broadcom’s $11.1B quarter underscores how AI pays for the “plumbing”
Bond report projects “AI era” to reach tens of billions of edge devices
Starcloud and Google explore space-based data centers for AI compute
🗣️ Builders’ stacks, long‑context warnings and UX pain points
Builders lean into multi-model stacks with Opus 4.5, GPT‑5.2 and Gemini
Amp debates whether 1M‑token threads help more than they hurt
Claude’s chat compaction and file UX spark pushback from knowledge workers
GPT‑5.2 Pro’s Extended and xhigh modes trade latency for reliability
💼 Market maps and enterprise platform signals
Updated open‑model tier list crowns DeepSeek, Qwen, Kimi as frontier labs
Google’s Antigravity IDE quietly ships as a free, agent‑first editor
Microsoft Copilot teases 2025 Flight Log and Smart Plus GPT‑5.2 mode
Statista map shows Beijing and Silicon Valley dominate AI share of VC
US computing patents classified as G06 spike after ChatGPT era
🎬 Creative pipelines: Gemini Flash, NB Pro grids, and Retake
Gemini 3 Flash spins an entire animated “video” from a single HTML prompt
3×3 grids evolve into multi-model pipelines for AI “cinematography”
LTX Retake turns 20-second clips into new shots with one prompt
DesignArena’s new Lotus and Cactus image models appear to be OpenAI GPT‑4
Nano Banana Pro gets a "cloud pareidolia" recipe for pseudo‑photography
Nano Banana Pro is being used for full literary spreads like A Christmas Carol
Horror and sci‑fi merchants are prototyping detailed pillow designs with NB Pro
🤖 Embodied: open robots ship, lab rigs get game‑ready
Reachy Mini robots start landing on desks as open-source AI hardware
Doom-playing rat rig evolves into a richer embodied learning setup
Gemini Live drives Stanford’s Puppers robot dog in new demo
🎙️ Realtime voice: Translate and site voice agents
Gemini speech-to-speech in Google Translate headed to devs next year
Solo consultant shows ElevenLabs voice agent wired into RAG and n8n CRM
Builders report noticeable Gemini Flash native audio quality jump
Community project brings Gemini Native Audio translation into the web stack