A research framework lets robot AI models figure out what they can't do — then go learn it on their own.
InSight, developed by researchers and detailed in a new paper, targets a core limitation of vision-language-action (VLA) models: they can only do what their training data showed them. The system works in two stages. First, it breaks existing demonstrations into labeled primitive actions — discrete moves like "lift upward" or "pour the bottle" — making the underlying model steerable at that granular level. Second, a vision-language model scans for missing primitives needed to complete a new task, attempts those moves autonomously, and feeds only the successful runs back into the training set. The robot is, in effect, curating its own curriculum.
The significance here isn't any single task — the researchers tested block flipping, drawer closing, sweeping, twisting, and pouring without a single human demonstration of those target skills. It's that the loop is closed: the model identifies its own blind spots, patches them, and the fixes compound over time. Most robot learning systems still require humans to intervene every time a new skill is needed; InSight is a bet that the bottleneck can be automated away.
The approach borrows from ideas that have circulated in reinforcement learning for years — self-play, automated curriculum generation — but applies them to the more constrained world of physical manipulation, where a failed attempt means a tipped bowl, not a lost game. Whether it scales beyond the lab's test tasks is the question that matters, and the paper doesn't fully answer it yet.
