GUI Agent Benchmark Dashboard

Why This Is Resume-Relevant

The project is strongest as an agent reliability artifact: it shows how to turn ambiguous GUI-agent failures into a reviewable evidence chain with validators and honest scope boundaries.

Deterministic Benchmark

Local browser tasks and judge criteria avoid subjective "looks successful" evaluations.

Evidence Chain

Each result links back to capture bundles, summaries, step traces, taxonomy, and finish gates.

Bounded Claims

The README and reports explicitly separate qualitative failure analysis from model ranking claims.

Task Matrix

Scores come from experiments/2026-05-24-uitars-expanded-real-round/real-run-summary.json.

Task	Score	Main failed primitive
`onboarding-form`	0.33	Text-entry continuation and submit
`catalog-filter`	0.00	Filter/search to selected item commit
`settings-toggle`	0.75	Dropdown value commit
`ticket-review`	0.00	Table search, selection, and review commit
`modal-confirmation`	0.00	Modal open and confirm sequence
`pagination-review`	0.33	Page navigation and row action
`sortable-inventory`	0.00	Sort commit and row selection
`multi-select-approvals`	0.00	Multi-select and submit
`validation-error-recovery`	0.40	Validation recovery after error
`file-upload-request`	0.25	Upload form dropdown and submit

Failure Primitive Distribution

Primary failure codes are mapped in the expanded failure taxonomy artifact.

ACT-DROPDOWN-VALUE-MISS2 tasks

ACT-TEXT-ENTRY-STALL1 task

ACT-SELECTION-COMMIT-MISS1 task

ACT-TABLE-SEARCH-LOOP1 task

ACT-MODAL-CONFIRMATION-MISS1 task

ACT-PAGINATION-NAV-MISS1 task

ACT-SORT-COMMIT-MISS1 task

ACT-MULTI-SELECT-MISS1 task

ACT-VALIDATION-RECOVERY-PARTIAL1 task

Evidence Chain

A task is not treated as closed just because a model run happened; the artifacts need to agree.

Deterministic task

UI-TARS attempt

Preflight target check

Capture bundle

Real-run summary

Step trace + taxonomy

Finish gate

Primary Evidence Links

These files are the source of truth behind the visual summary.

Failure-analysis benchmark for browser GUI agents

Expanded Round Task Scores

Why This Is Resume-Relevant

Deterministic Benchmark

Evidence Chain

Bounded Claims

Task Matrix

Failure Primitive Distribution

Evidence Chain

Primary Evidence Links