Data Agent Benchmark launches with 54 queries and 38% pass@1
Data Agent Benchmark launches with 54 enterprise-style queries across 12 datasets, nine domains, and four database systems, while the best frontier model reaches only 38% pass@1. It gives teams a stronger eval for cross-database agents than text-to-SQL-only benchmarks.

TL;DR
- The new Data Agent Benchmark, or DAB, targets a gap in agent evals by testing how well models query and combine data across multiple enterprise databases, with 54 queries spanning 12 datasets, nine domains, and four DBMSs launch thread.
- In the authors' launch thread, the current ceiling is still low: the best frontier model reached only 38% pass@1 across 50 trials, suggesting plenty of headroom for agentic data work.
- The release explicitly goes beyond "vanilla text2SQL/TableQA benchmarks," as the release repost describes it, shifting the focus from single-table SQL generation to multi-step agent workflows.
- That framing matches a broader practitioner complaint that benchmarks still test "single clean tables like it's 2019," according to Gregor repost, which is exactly the setup DAB is trying to move past.
What does DAB actually test?
DAB is built around enterprise-style data tasks where an agent has to inspect databases, issue queries, run Python, and return an answer inside a ReAct-style loop. The authors' launch thread say the benchmark covers 54 queries across 12 datasets, nine domains, and four database management systems, grounded in a formative study of real enterprise data-agent workloads.
The attached [img:0|DAB setup] shows the benchmark is not just SQL synthesis. In the example trace, the agent lists tables in PostgreSQL and SQLite, queries both systems, then uses Python to reconcile mismatched keys before producing a final answer. That means the eval includes cross-database joins, unstructured-text extraction, and tool use, not just generating one correct query string. The project is also published with a paper, benchmark code, and leaderboard.
How hard is it, and why does it matter?
The headline result is the difficulty curve: the launch thread reports that the best frontier model manages only 38% pass@1 over 50 trials. For engineering teams, that makes DAB more useful as a live eval than saturated text-to-SQL suites where top models bunch near the ceiling.
The release thread frames DAB as "going beyond vanilla text2SQL/TableQA benchmarks," and the release repost makes the same point directly. A supporting reaction from the reposted comment captures the critique of older evals: they keep testing SQL generation on "single clean tables like it's 2019." Even a contextual reaction from Lech Mazur treats DataAgentBench as one of the still-unsaturated agent benchmarks, reinforcing the case that multi-database, tool-using data agents remain far from solved.