Why This Is Resume-Relevant
The project is strongest as an agent reliability artifact: it shows how to turn ambiguous GUI-agent failures into a reviewable evidence chain with validators and honest scope boundaries.
Deterministic Benchmark
Local browser tasks and judge criteria avoid subjective "looks successful" evaluations.
Evidence Chain
Each result links back to capture bundles, summaries, step traces, taxonomy, and finish gates.
Bounded Claims
The README and reports explicitly separate qualitative failure analysis from model ranking claims.
Task Matrix
Scores come from experiments/2026-05-24-uitars-expanded-real-round/real-run-summary.json.
| Task | Score | Main failed primitive |
|---|---|---|
onboarding-form | 0.33 | Text-entry continuation and submit |
catalog-filter | 0.00 | Filter/search to selected item commit |
settings-toggle | 0.75 | Dropdown value commit |
ticket-review | 0.00 | Table search, selection, and review commit |
modal-confirmation | 0.00 | Modal open and confirm sequence |
pagination-review | 0.33 | Page navigation and row action |
sortable-inventory | 0.00 | Sort commit and row selection |
multi-select-approvals | 0.00 | Multi-select and submit |
validation-error-recovery | 0.40 | Validation recovery after error |
file-upload-request | 0.25 | Upload form dropdown and submit |
Failure Primitive Distribution
Primary failure codes are mapped in the expanded failure taxonomy artifact.
Evidence Chain
A task is not treated as closed just because a model run happened; the artifacts need to agree.
Primary Evidence Links
These files are the source of truth behind the visual summary.