AI shopping agents can't actually shop — at least not well, according to a new benchmark designed to find out.
Researchers introduced EComAgentBench, a set of 662 tasks built on real Amazon products and reviews, to test how well large-language-model-based shopping agents handle the messy way real buyers communicate. Instead of handing an agent a clean, complete request, the benchmark scatters requirements across a visible query, a tool-gated profile, and scripted clarification exchanges — mimicking how a shopper might state one thing, imply another, and reveal a third only when asked. Agents must resolve all of it and commit to a single product within 100 tool calls. The team evaluated seven models; the best hit only 57.1% overall accuracy, and performance dropped further when requirements were hidden rather than stated upfront.
Most existing shopping-agent benchmarks hand over full intent at the start and score only the final pick — a setup that masks exactly where and why an agent fails. EComAgentBench's rubrics are source-tagged, meaning each failure is attributed to a specific requirement and where it was buried. That granularity matters: it shifts the research question from "did the agent get it right" to "which part of the buyer's intent did it miss and why."
A 57% ceiling on the best model is a useful reality check for anyone watching retailers rush to deploy AI assistants — the gap between a working product demo and a dependable shopping agent is apparently still wide.
