Google’s BATS lifts BrowseComp accuracy to 24.6% – cuts agent cost 31%

Stay in the loop

Free daily newsletter & Telegram daily report

Executive Summary

Google’s new BATS framework is the first agent paper this week that feels directly aimed at your cloud bill. The lightweight Budget Tracker plugin alone hits ReAct‑level accuracy on web tasks using 10 instead of 100 tool calls and trims overall cost by about 31%, simply by exposing live “query budget remaining” counters inside the agent’s reasoning loop.

Full BATS orchestration pushes harder on quality. On BrowseComp, a Gemini‑2.5‑Pro agent with BATS scores 24.6% vs ReAct’s 12.6% under the same 100‑tool cap; BrowseComp‑ZH jumps to 46.0% vs 31.5%, and HLE‑Search to 27.0% vs 20.5%. Planning and self‑verification both become budget‑aware, and the paper introduces a unified cost metric that blends token spend and tool‑call prices so you can reason in dollars, not abstract “steps.”

DAIR.AI is already teaching BATS in its agent courses, which is usually a sign a pattern is graduating from research toy to production norm. Paired with Artificial Analysis’ token‑usage charts showing GPT‑5.2‑xhigh burning nearly 2× Sonnet’s tokens for similar work, the direction of travel is clear: agents that ignore budgets are going to feel as dated as models that ignore context windows. Time to make “budget‑aware by design” part of your standard agent spec.

Feature: Budget‑aware agent scaling (BATS)

Google’s BATS makes agents budget‑aware, doubling BrowseComp accuracy vs ReAct under equal budgets and hitting ReAct‑level accuracy with 10× fewer tool calls—clear design guidance for reliable, cheaper web agents.

Cross‑account coverage converges on Google’s BATS/Budget Tracker paper showing that making agents explicitly aware of tool‑call budgets lifts accuracy and slashes cost; mostly web‑agent results and concrete deltas vs ReAct.

Jump to Feature: Budget‑aware agent scaling (BATS) topics

🎯 Feature: Budget‑aware agent scaling (BATS)

Google’s BATS makes web agents budget‑aware, slashing tool use and cost

Google researchers released the BATS framework and its lightweight Budget Tracker plugin, showing that making web agents explicitly aware of their remaining search/browse budget can match or beat ReAct while using an order of magnitude fewer tool calls and ~31% less cost. paper thread Budget Tracker alone reaches ReAct‑level accuracy with 10 vs 100 tool calls (−40.4% search, −21.4% browse, −31.3% overall cost) by surfacing live counters like “query budget remaining” inside the agent’s reasoning loop. detailed recap

Full BATS orchestration goes further by making planning and self‑verification budget‑aware: on BrowseComp, a Gemini‑2.5‑Pro agent with BATS hits 24.6% accuracy vs ReAct’s 12.6% under the same 100‑tool cap, with similar gains on BrowseComp‑ZH (46.0% vs 31.5%) and HLE‑Search (27.0% vs 20.5%). paper thread The paper also proposes a unified cost metric that combines token spend and tool‑call cost, so teams can reason about accuracy vs money instead of raw hit rate alone. ArXiv paper DAIR.AI is already using BATS as a teaching example in its agent courses, signaling that “budget‑aware by design” is likely to become a standard pattern for serious production agents rather than an optimization afterthought. dair ai courses

Google’s BATS lifts BrowseComp accuracy to 24.6% – cuts agent cost 31%

Executive Summary

Top links today

Feature: Budget‑aware agent scaling (BATS)

Table of Contents

🎯 Feature: Budget‑aware agent scaling (BATS)

Google’s BATS makes web agents budget‑aware, slashing tool use and cost

📊 Frontier scoreboards and token economics

AA‑Omniscience Index finds Gemini and Opus more reliable than GPT‑5.2

Artificial Analysis Index ties Gemini 3 and GPT‑5.2 at the top

AA token-usage chart exposes big efficiency gaps between frontier models

CritPt benchmark shows GPT‑5.1 and GPT‑5.2 at 0% on this suite

🧰 Skills, agent stacks and coding workflows

Codex CLI gets experimental “skills” with AGENTS.md and SKILL.md patterns

Azure AI samples repo connects local Ollama and LangChain.js to Azure serverless RAG

LangChain community ships AI Travel Agent template with six tools and cloud deploys

Synapse Workflows shows a LangGraph multi‑agent stack for search, productivity and data

Cursor vs Droid comparison shows how IDE instructions quietly change Codex behavior

JustHTML becomes a flagship case study for serious agent‑assisted coding

Unix-style composable tools emerge as a pattern for coding agent stacks

Warp terminal lets agents pull live command output via @‑references

Yutori’s Scouts opens up a push‑based browser research agent built on SoTA web use

FAANG engineer outlines a realistic AI‑infused coding workflow from design doc to prod

🧪 Agentic coding: from papers to production‑grade code

DeepCode agent surpasses human PhDs and commercial tools at paper-to-code

Chain of Unit-Physics bakes physics tests into multi-agent code generation

🧠 Long‑context and deterministic verification advances

BEAVER offers deterministic safety bounds for LLM rule‑following

RoPE++ keeps imaginary attention to halve KV cache with 64k+ context

🏗️ AI infra cycle, memory bottlenecks and export risk

AI-driven semiconductor “giga cycle” pushes chips toward ~$1T and HBM toward $100B

US bill seeks 30‑month halt on Nvidia H200 export licences to China

Broadcom’s $11.1B quarter underscores how AI pays for the “plumbing”

Bond report projects “AI era” to reach tens of billions of edge devices

Starcloud and Google explore space-based data centers for AI compute

🗣️ Builders’ stacks, long‑context warnings and UX pain points

Builders lean into multi-model stacks with Opus 4.5, GPT‑5.2 and Gemini

Amp debates whether 1M‑token threads help more than they hurt

Claude’s chat compaction and file UX spark pushback from knowledge workers

GPT‑5.2 Pro’s Extended and xhigh modes trade latency for reliability

💼 Market maps and enterprise platform signals

Updated open‑model tier list crowns DeepSeek, Qwen, Kimi as frontier labs

Google’s Antigravity IDE quietly ships as a free, agent‑first editor

Microsoft Copilot teases 2025 Flight Log and Smart Plus GPT‑5.2 mode

Statista map shows Beijing and Silicon Valley dominate AI share of VC

US computing patents classified as G06 spike after ChatGPT era

🎬 Creative pipelines: Gemini Flash, NB Pro grids, and Retake

Gemini 3 Flash spins an entire animated “video” from a single HTML prompt

3×3 grids evolve into multi-model pipelines for AI “cinematography”

LTX Retake turns 20-second clips into new shots with one prompt

DesignArena’s new Lotus and Cactus image models appear to be OpenAI GPT‑4

Nano Banana Pro gets a "cloud pareidolia" recipe for pseudo‑photography

Nano Banana Pro is being used for full literary spreads like A Christmas Carol

Horror and sci‑fi merchants are prototyping detailed pillow designs with NB Pro

🤖 Embodied: open robots ship, lab rigs get game‑ready

Reachy Mini robots start landing on desks as open-source AI hardware

Doom-playing rat rig evolves into a richer embodied learning setup

Gemini Live drives Stanford’s Puppers robot dog in new demo

🎙️ Realtime voice: Translate and site voice agents

Gemini speech-to-speech in Google Translate headed to devs next year

Solo consultant shows ElevenLabs voice agent wired into RAG and n8n CRM

Builders report noticeable Gemini Flash native audio quality jump

Community project brings Gemini Native Audio translation into the web stack

On this page