- New benchmark uncovers hidden weaknesses in web‑automation agents.
Researchers released WebStep, a suite of 1,800 web‑task instances that tracks a deterministic semantic state behind each GUI. The benchmark logs high‑level states automatically, letting scientists compare agents step‑by‑step instead of judging only final success. Three agents—OpenAI CUA, Qwen 3.5 and a third unnamed model—ended with overall success rates of 31‑33 %, but their process metrics diverged sharply. OpenAI CUA reached 23.7 % more commit actions on a housing search than Qwen 3.5, yet lagged 15.6 % on filtering actions.
Why it matters: process data pinpoints exactly which sub‑tasks need work, something outcome‑only scores cannot do. The study also shows that as task difficulty rises, the gap between agents widens, suggesting that current models may crumble under realistic exploration demands. Developers now have a concrete target—improve the specific skill that causes a bifurcation error—rather than chasing a vague improvement in overall accuracy.
The takeaway is modest: WebStep offers a scalable way to audit web agents beyond headline success percentages, giving the community a means to chase incremental, skill‑level gains rather than chasing marginal overall gains.