GUI Agent Benchmark

Failure-analysis benchmark for browser GUI agents

A deterministic local benchmark that keeps the evidence chain intact: task definitions, UI-TARS attempts, capture bundles, judge results, step traces, failure taxonomy, and finish-gate validation.

Not a leaderboard 10 browser tasks 40 judge criteria Qualitative analysis
10/10 Captured attempts Expanded UI-TARS-style round
0/10 Full successes Useful as diagnostic evidence
0.206 Average score Captured task mean score
10/10 Oracle solvable Scripted UI baseline passed

Expanded Round Task Scores

2026-05-24 evidence
settings-toggle0.75
validation-error-recovery0.40
onboarding-form0.33
pagination-review0.33
file-upload-request0.25
catalog-filter0.00
ticket-review0.00
modal-confirmation0.00
sortable-inventory0.00
multi-select-approvals0.00

Why This Is Resume-Relevant

The project is strongest as an agent reliability artifact: it shows how to turn ambiguous GUI-agent failures into a reviewable evidence chain with validators and honest scope boundaries.

Deterministic Benchmark

Local browser tasks and judge criteria avoid subjective "looks successful" evaluations.

Evidence Chain

Each result links back to capture bundles, summaries, step traces, taxonomy, and finish gates.

Bounded Claims

The README and reports explicitly separate qualitative failure analysis from model ranking claims.

Task Matrix

Scores come from experiments/2026-05-24-uitars-expanded-real-round/real-run-summary.json.

Task Score Main failed primitive
onboarding-form0.33Text-entry continuation and submit
catalog-filter0.00Filter/search to selected item commit
settings-toggle0.75Dropdown value commit
ticket-review0.00Table search, selection, and review commit
modal-confirmation0.00Modal open and confirm sequence
pagination-review0.33Page navigation and row action
sortable-inventory0.00Sort commit and row selection
multi-select-approvals0.00Multi-select and submit
validation-error-recovery0.40Validation recovery after error
file-upload-request0.25Upload form dropdown and submit

Failure Primitive Distribution

Primary failure codes are mapped in the expanded failure taxonomy artifact.

ACT-DROPDOWN-VALUE-MISS2 tasks
ACT-TEXT-ENTRY-STALL1 task
ACT-SELECTION-COMMIT-MISS1 task
ACT-TABLE-SEARCH-LOOP1 task
ACT-MODAL-CONFIRMATION-MISS1 task
ACT-PAGINATION-NAV-MISS1 task
ACT-SORT-COMMIT-MISS1 task
ACT-MULTI-SELECT-MISS1 task
ACT-VALIDATION-RECOVERY-PARTIAL1 task

Evidence Chain

A task is not treated as closed just because a model run happened; the artifacts need to agree.

Deterministic task
UI-TARS attempt
Preflight target check
Capture bundle
Real-run summary
Step trace + taxonomy
Finish gate

Primary Evidence Links

These files are the source of truth behind the visual summary.