TryCua launches Cua-Bench for KiCad; GPT-5.5 clears 6 of 25 tasks
TryCua and Snorkel opened Cua-Bench, a computer-use benchmark with 25 expert-authored KiCad tasks graded by exact netlist matches. The early results show frontier models still struggle with GUI execution, wiring completion, and self-checking, so treat benchmark wins as incomplete for real computer-use work.

TL;DR
- Cua-Bench's launch thread says the new benchmark covers 25 expert-authored KiCad tasks, and the best frontier model in the opening run cleared only 6.
- According to the methodology thread, each task was written by a practicing electrical engineer, reviewed by a second engineer, and graded by exact netlist match, with no LLM judge in the loop.
- the results post puts GPT-5.5 at 6 of 25 full passes, while Claude Sonnet 4.5 and Haiku 4.5 each reached 5.
- the task breakdown and the execution analysis both point to the same cliff: models can edit existing schematics, but from-scratch wiring runs usually stall before completion.
- the failure examples shows the ugly part of computer use evals: some agents declared success after leaving a resistor unconnected or formatting a value wrong inside KiCad.
You can jump straight to the public leaderboard, skim Snorkel's deeper analysis, and watch the launch demo if you want to see the benchmark in motion. The interesting bit is not that KiCad is hard, it's that Cua-Bench isolates the hard part cleanly: the models already know enough electronics to reason about op-amps and resistor values, but the execution thread says they still lose the plot when they have to place, wire, verify, and save inside a real desktop tool.
KiCad is the point
Most computer-use benchmarks live in browsers or generic system apps. Cua's setup thread argues that lets agents survive by brute-force clicking and trial-and-error recovery.
KiCad is a nastier target. It combines dense shortcuts, multiple linked editors, and lots of visually similar parts, so the agent has to track the circuit and the GUI state at the same time.
That design choice makes the opening scores more revealing than a raw 6-of-25 headline. A model that can stumble through a website can still break badly when every click changes a structured artifact.
Exact netlists, no LLM judge
Cua-Bench's grading is unusually crisp for an agent benchmark. The grading thread says each task came from a practicing electrical engineer, got a second-engineer review, and passed only if the produced netlist matched ground truth exactly.
That removes a common source of benchmark mush. There is no model grading another model's output, and no partial credit for a schematic that looks plausible but wires the wrong pins.
The benchmark is also narrow on purpose. The launch thread frames it as professional software evaluation, and the first slice is just 25 KiCad tasks rather than a mixed bag of office apps and websites.
Execution, not electronics
Cua's execution thread says the models already know the electronics. They can identify parts, size resistors, and reason about bias points.
The miss is operational. Every full pass in the launch cohort was an edit to an existing schematic, while the task-results thread says the from-scratch tasks ended with unfinished wiring.
That split is a useful mental model for the whole benchmark:
- Existing schematic edit: models can sometimes localize the right component and make one precise change.
- Blank-canvas build: they have to place parts, connect nets, check the result, and keep context across a longer run.
- End-to-end verification: even after doing the work, they still need to inspect the actual screen state before declaring completion.
Planning, perception, navigation
Snorkel's linked analysis adds the error map the tweet thread only hints at. According to Snorkel's write-up, the opening cohort achieved 4 of 25 full passes in the deeper analysis run, with 16 tasks hitting the step cap and 5 ending in a false declaration of success.
The same Snorkel analysis groups most error mentions into three buckets:
- Planning and policy errors, about 40%
- Perception issues, about 22%
- Navigation inefficiencies, about 19%
Tooling itself barely shows up in the blame chart. Snorkel's breakdown says tool or API surface issues were about 8%, with zero API errors, which shifts the story away from harness bugs and toward agent loop quality.
It also reports a lot of overhead before useful work starts. Onboarding dialogs, first-run prompts, update checks, and getting routed into the wrong editor consumed a surprising amount of the run budget.
Self-checking is still brittle
The most concrete failure mode in the evidence is self-verification. Cua's examples thread says some agents graded their own work by rereading what they had typed instead of checking the resulting schematic on screen.
The posted examples are small, but brutal:
- One agent left a resistor dangling and still called it connected.
- Another wrote
2.80kOhmwhere KiCad expected2.8k.
Those are not domain-knowledge misses. They are GUI grounding misses, plus weak final-check behavior, which is exactly the kind of bug a benchmark on professional software should smoke out.
Leaderboard is open
Cua launched the benchmark as a live public leaderboard, not a one-off paper result. The leaderboard invite says model providers can submit runs if they think they can beat 6 of 25.
That matters because the benchmark's first useful job is comparative, not absolute. The opening numbers are low, but the setup creates a clean place to measure whether future computer-use agents get better at multi-step desktop work, or just better at looking busy while the step budget burns down.