Data Agent Benchmark launches with 54 enterprise-style queries across 12 datasets, nine domains, and four database systems, while the best frontier model reaches only 38% pass@1. It gives teams a stronger eval for cross-database agents than text-to-SQL-only benchmarks.

DAB is built around enterprise-style data tasks where an agent has to inspect databases, issue queries, run Python, and return an answer inside a ReAct-style loop. The authors' launch thread say the benchmark covers 54 queries across 12 datasets, nine domains, and four database management systems, grounded in a formative study of real enterprise data-agent workloads.
The attached [img:0|DAB setup] shows the benchmark is not just SQL synthesis. In the example trace, the agent lists tables in PostgreSQL and SQLite, queries both systems, then uses Python to reconcile mismatched keys before producing a final answer. That means the eval includes cross-database joins, unstructured-text extraction, and tool use, not just generating one correct query string. The project is also published with a paper, benchmark code, and leaderboard.
The headline result is the difficulty curve: the launch thread reports that the best frontier model manages only 38% pass@1 over 50 trials. For engineering teams, that makes DAB more useful as a live eval than saturated text-to-SQL suites where top models bunch near the ceiling.
The release thread frames DAB as "going beyond vanilla text2SQL/TableQA benchmarks," and the release repost makes the same point directly. A supporting reaction from the reposted comment captures the critique of older evals: they keep testing SQL generation on "single clean tables like it's 2019." Even a contextual reaction from Lech Mazur treats DataAgentBench as one of the still-unsaturated agent benchmarks, reinforcing the case that multi-database, tool-using data agents remain far from solved.
Databases are arguably the most commonly used enterprise tool, and enterprises typically have many of them. Yet no popular AI agent benchmark actually tests how well agents can query, join, and make sense of data across different databases! So, we built DAB (Data Agent Show more
Excited to release the Data Agent Benchmark, going beyond vanilla text2SQL/TableQA benchmarks to stress-test how models work with (and join data across) multiple database backends employing different schemas and encodings. Turns out no models do well! Hoping this will spur Show more
This was a fun collaboration between @UCBEPIC, @Berkeley_EECS, and @PromptQL Read our paper here: arxiv.org/abs/2603.20576 Check out the benchmark here: github.com/ucbepic/DataAg… Leaderboard here: ucbepic.github.io/DataAgentBench/