Teaching Robots to Fetch Your Cup, Not Just a Cup

Robots can follow generic commands fine — the hard part is knowing your mug.

Researchers have published a method called Visual Attentive Prompting (VAP) that lets existing Vision-Language-Action models pick out a specific object a user owns, even when that object looks nearly identical to others nearby. The system takes a handful of reference images of the personal item, uses open-vocabulary detection to locate it in the scene, matches it against those images via embeddings, and then highlights the target and rewrites the robot's instruction on the fly. Crucially, VAP requires no retraining — it wraps around a frozen model as a lightweight adapter.

The gap it closes matters: most robot AI today handles categories, not instances. Telling a robot to "bring a cup" works; telling it to "bring my cup" has been a reliable way to watch it grab the wrong one. VAP's benchmarks — two simulation suites and a real-world tabletop test — show consistent gains over both generic policies and token-learning alternatives on success rate and correct-object selection.

The approach leans on techniques already common in vision AI (embedding-based matching, visual prompting) rather than inventing new architecture, which is either elegantly pragmatic or a sign the underlying VLA models still can't do basic instance recognition on their own — depending on how charitable you feel.

← Back to the front page