HiL-Bench
Benchmark measuring whether AI coding agents know when to ask for help.
A human-in-the-loop benchmark for coding agents that measures selective escalation—when an agent should ask for help—using missing, ambiguous, or contradictory blockers in SWE and text-to-SQL tasks.
Recent stories
0 linked stories
No linked stories yet.