vision-language/ robotics · occlusion

LIBERO-Occ benchmark reveals VLA weakness to occlusion, VIM offers fix

New benchmark shows vision-language-action models drop sharply when objects are hidden, and a view‑generation trick narrows the gap.

A new benchmark called LIBERO-Occ tests vision‑language‑action (VLA) systems on tasks where objects are partly blocked from view.

The authors extended the existing LIBERO suite with scenarios that deliberately hide task‑relevant items. Across several state‑of‑the‑art VLA models, performance fell by up to 30 % compared with clear‑view baselines. To combat this, they introduced Viewpoint Imagination (VIM), a module that fabricates an alternative camera angle from the single observed frame and feeds both real and imagined views to the action predictor. VIM restored most of the lost accuracy without needing extra hardware at test time.

The result matters because real‑world robots rarely enjoy an unobstructed line of sight; shelves, hands, and other objects constantly block what they need to see. Demonstrating a software‑only way to fill in missing visual information makes VLA models more viable for warehouse picking, home assistance, and similar tasks where adding cameras is costly.

The work is a reminder that high benchmark scores can mask brittleness—if you only test in perfect lighting and angles, you may miss a model’s most fatal flaw.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →