Reproducible GUI-agent failure analysis

GUI Agent Benchmark

A deterministic browser benchmark and evidence-chain workspace for studying where GUI agents fail at the primitive interaction level.

Summary dashboard showing task capture, success rate, average score, and failure primitive distribution
10/10 UI-TARS-style tasks captured
0/10 full end-to-end successes
0.206 average diagnostic score
40 judge criteria preserved

Evidence boundary: this release is a diagnostic failure-analysis harness, not a leaderboard claim. The expanded round has complete capture coverage, while historical step traces remain derived timeline attributions rather than raw UI-TARS action transcripts.