releaseMarch 25, 2026

Data Agent Benchmark launches with 54 queries and 38% pass@1

Data Agent Benchmark launches with 54 enterprise-style queries across 12 datasets, nine domains, and four database systems, while the best frontier model reaches only 38% pass@1. It gives teams a stronger eval for cross-database agents than text-to-SQL-only benchmarks.

Evals Benchmarks Developer Experience

3 min read

Data Agent Benchmark launches with 54 queries and 38% pass@1

TL;DR

The new Data Agent Benchmark, or DAB, targets a gap in agent evals by testing how well models query and combine data across multiple enterprise databases, with 54 queries spanning 12 datasets, nine domains, and four DBMSs launch thread.
In the authors' launch thread, the current ceiling is still low: the best frontier model reached only 38% pass@1 across 50 trials, suggesting plenty of headroom for agentic data work.
The release explicitly goes beyond "vanilla text2SQL/TableQA benchmarks," as the release repost describes it, shifting the focus from single-table SQL generation to multi-step agent workflows.
That framing matches a broader practitioner complaint that benchmarks still test "single clean tables like it's 2019," according to Gregor repost, which is exactly the setup DAB is trying to move past.

What does DAB actually test?

Shreya Shankar

@sh_reya

·Follow

Databases are arguably the most commonly used enterprise tool, and enterprises typically have many of them. Yet no popular AI agent benchmark actually tests how well agents can query, join, and make sense of data across different databases! So, we built DAB (Data Agent Show more

5:15 PM · Mar 25, 2026

399

Read 28 replies

DAB is built around enterprise-style data tasks where an agent has to inspect databases, issue queries, run Python, and return an answer inside a ReAct-style loop. The authors' launch thread say the benchmark covers 54 queries across 12 datasets, nine domains, and four database management systems, grounded in a formative study of real enterprise data-agent workloads.

The attached [img:0|DAB setup] shows the benchmark is not just SQL synthesis. In the example trace, the agent lists tables in PostgreSQL and SQLite, queries both systems, then uses Python to reconcile mismatched keys before producing a final answer. That means the eval includes cross-database joins, unstructured-text extraction, and tool use, not just generating one correct query string. The project is also published with a paper, benchmark code, and leaderboard.

How hard is it, and why does it matter?

Aditya Parameswaran

@adityagp

·Follow

Excited to release the Data Agent Benchmark, going beyond vanilla text2SQL/TableQA benchmarks to stress-test how models work with (and join data across) multiple database backends employing different schemas and encodings. Turns out no models do well! Hoping this will spur Show more

Shreya Shankar

@sh_reya

This was a fun collaboration between @UCBEPIC, @Berkeley_EECS, and @PromptQL Read our paper here: arxiv.org/abs/2603.20576 Check out the benchmark here: github.com/ucbepic/DataAg… Leaderboard here: ucbepic.github.io/DataAgentBench/

5:31 PM · Mar 25, 2026

Read 2 replies

The headline result is the difficulty curve: the launch thread reports that the best frontier model manages only 38% pass@1 over 50 trials. For engineering teams, that makes DAB more useful as a live eval than saturated text-to-SQL suites where top models bunch near the ceiling.

The release thread frames DAB as "going beyond vanilla text2SQL/TableQA benchmarks," and the release repost makes the same point directly. A supporting reaction from the reposted comment captures the critique of older evals: they keep testing SQL generation on "single clean tables like it's 2019." Even a contextual reaction from Lech Mazur treats DataAgentBench as one of the still-unsaturated agent benchmarks, reinforcing the case that multi-database, tool-using data agents remain far from solved.

🧾 More sources

TL;DR1 tweets

High-level summary of the launch, scope, and main performance result, using the core announcement and supporting practitioner reactions.

How hard is it, and why does it matter?2 tweets

Covers the reported pass@1 result and the benchmark's relevance versus older text-to-SQL style evaluations.