releaseMarch 25, 2026

Data Agent Benchmark launches with 54 queries and 38% pass@1

Data Agent Benchmark launches with 54 enterprise-style queries across 12 datasets, nine domains, and four database systems, while the best frontier model reaches only 38% pass@1. It gives teams a stronger eval for cross-database agents than text-to-SQL-only benchmarks.

Evals Benchmarks Developer Experience

3 min read

Data Agent Benchmark launches with 54 queries and 38% pass@1

TL;DR

The new Data Agent Benchmark, or DAB, targets a gap in agent evals by testing how well models query and combine data across multiple enterprise databases, with 54 queries spanning 12 datasets, nine domains, and four DBMSs launch thread.
In the authors' launch thread, the current ceiling is still low: the best frontier model reached only 38% pass@1 across 50 trials, suggesting plenty of headroom for agentic data work.
The release explicitly goes beyond "vanilla text2SQL/TableQA benchmarks," as the release repost describes it, shifting the focus from single-table SQL generation to multi-step agent workflows.
That framing matches a broader practitioner complaint that benchmarks still test "single clean tables like it's 2019," according to Gregor repost, which is exactly the setup DAB is trying to move past.