HiL-Bench

Benchmark measuring whether AI coding agents know when to ask for help.

A human-in-the-loop benchmark for coding agents that measures selective escalation—when an agent should ask for help—using missing, ambiguous, or contradictory blockers in SWE and text-to-SQL tasks.

Recent stories

0 linked stories

No linked stories yet.