Seeing and doing, it turns out, are two very different problems for AI.
Researchers introduced ROSE (Reference-conditioned Oddity and Symbolic Execution), a benchmark designed to isolate whether multimodal large language models can translate the same visual scene into the right action when the task context changes. The setup is deliberately controlled: the image stays fixed while the required output shifts between counting objects and executing coordinate-based actions within a constrained region. Nine recent multimodal models were tested. Humans scored 98.8% across tasks. The models did not.
The gap matters because the AI industry is racing to deploy vision-capable agents — systems that don't just describe images but act on them, clicking, selecting, and navigating. ROSE exposes a specific failure mode that benchmarks focused on visual question answering or image captioning would miss entirely: a model can correctly count objects in a scene and still fail to act on that same information when a region constraint and a symbolic output are added. The performance drop reached 44.5 percentage points between counting tasks and region-conditioned action tasks.
What makes the finding harder to hand-wave away is that the gap persists even on paired scenes where the model already got the count right — meaning the model had the visual evidence and still couldn't convert it into the correct action. The researchers found that coordinate grounding explains only part of the loss, pointing to a separate, model-dependent bottleneck. In other words, each model breaks differently, which complicates any one-size-fixes-all solution. The broader implication: benchmarks that test perception alone are telling labs what they want to hear.