A new research framework exposes how far GUI agents still have to go before they can reliably operate a smartphone like a human.
ScaleWoB is a framework that generates synthetic, interactive web environments for testing and training large language model-based GUI agents. Instead of spinning up virtual machines or Docker containers, its environments run as backend-free webpages — low setup cost, low resource overhead. The system covers mobile, desktop, and automotive interfaces through a single pipeline, producing over 1,000 verifiable tasks across 100-plus environments. The researchers also released a dedicated mobile benchmark: 120 tasks spanning 63 simulated apps.
The results make the hype around autonomous agents look premature. Five state-of-the-art mobile GUI agents averaged a 27.92% success rate on the benchmark — dropping to 17.82% on longer, multi-step tasks. Humans cleared 92.08%. The researchers also checked whether synthetic results generalize to real apps, and found they do, which makes the gap harder to dismiss as a benchmark artifact.
Most existing GUI benchmarks are confined to open-source apps or simple file operations because verifiable rewards are hard to build in messy real-world environments. ScaleWoB sidesteps that by synthesizing the environments themselves — a practical tradeoff, though one that will inevitably invite questions about what gets lost when you replace the real world with a simulation.