RetailBench Tests Whether LLM Agents Can Run a Store

A new benchmark tests whether LLM agents can operate a supermarket for six months without going under.

Researchers introduced RetailBench, a simulation that models single-store supermarket operations as a partially observable decision process. Agents must simultaneously manage pricing, inventory replenishment, supplier selection, shelf layout, aging stock, customer feedback, and cash-flow constraints. Seven contemporary LLMs were evaluated across a 180-day run; the benchmark is built to support simulations far longer, up to a thousand days. Only a small subset of the models survived the full evaluation horizon, and even the strongest performers fell well short of an oracle policy in net worth and sales outcomes.

Most AI agent benchmarks test short, well-scoped tasks where a bad decision resets after a few steps. RetailBench forces sustained, interdependent choices: a bad pricing call in week two can compound into an inventory crisis by month four. Behavioral analysis attributes the gaps to incomplete information gathering, surface-level reasoning, and the absence of a consistent long-horizon strategy.

That checklist maps closely onto what most enterprise AI deployments run into in the real world, which is either encouraging because researchers are finally measuring the right things, or unsettling because the best available models still cannot manage a corner store.

← Back to the front page