releaseMarch 30, 2026

Tau Bench adds Banking domain with 698 documents and sub-10% pass^4

Tau Bench added a banking customer-support domain with 698 docs across 21 product categories and multi-step tool use. The best model managed only 25% task success and under 10% pass^4, showing how far agents are from reliable support automation.

Evals LLM as Judge Benchmarks

2 min read

Tau Bench adds Banking domain with 698 documents and sub-10% pass^4

TL;DR

Tau Bench added a new Banking domain that turns the benchmark toward customer-support style agent work, with a fintech knowledge base spanning 698 documents across 21 product categories, according to _philschmid's update Tau Bench update.
The new tasks are not just retrieval tests: agents have to search the corpus, reason over policy details, and complete multi-step tool actions in one flow, as described in the launch thread launch thread.
Early results are weak. _philschmid's leaderboard summary leaderboard summary says the best model reaches about 25% task success, while the linked paper paper summary reports roughly 25.5% success with pass^4 dropping below 10%.
That makes the update useful for engineers building support agents: the paper abstract paper abstract frames the gap as reliable use of unstructured knowledge plus tool execution, not just one-shot reasoning.

What does the new Banking benchmark actually add?

Philipp Schmid

@_philschmid

·Follow

Tau Bench got an update! Tau Bench is one of the most adopted Agentic Benchmarks. They now added “Banking” a fintech-inspired customer support domain built around a realistic knowledge base of 698 documents across 21 product categories. Tasks require agents to search this Show more

4:32 PM · Mar 30, 2026

Read 8 replies

The new Banking domain pushes Tau Bench closer to a real support workflow. Instead of a narrow API-only task set, it is built around a “realistic knowledge base” of 698 documents across 21 product categories, and the benchmark asks agents to combine document search, policy reasoning, and tool use in the same interaction update details. One example request in _philschmid's thread example task combines two intents in one turn: “dispute” a transaction and “file a credit limit increase request.”

Philipp Schmid

@_philschmid

·Follow

Replying to @_philschmid

Leaderboard: taubench.com/#leaderboard?b… Paper: arxiv.org/abs/2603.04370

4:32 PM · Mar 30, 2026

Read more on X

The linked paper page tau-Knowledge paper describes this as a benchmark for “conversational agents over unstructured knowledge,” which is the important shift for implementation teams. The failure mode is not simply whether a model can call a tool; it is whether it can retrieve the right policy from a densely linked corpus and then produce a verifiable, policy-compliant action. The leaderboard and paper links leaderboard reported in the thread show how hard that remains: top models land around 25% task success, and reliability degrades sharply under repeated attempts, with pass^4 under 10% results summary.