Robots Learn to Grab by Watching Human Hands

Robots still can't grab a coffee mug as casually as you can — but a new model called HUG is closing that gap by mining data from the one place human grasping is plentiful: human hands.

Researchers built HUG, a flow-matching model that predicts how a hand should grip any object visible in a single depth-enabled image. To train it, they strapped smart glasses to people walking through 41 buildings and recorded 27.8 hours of everyday grasping across 6,707 distinct objects — a dataset they call 1M-HUGs. The model takes RGB and depth data from a stereo camera, then outputs the wrist position, wrist rotation, and full hand pose needed to make the grab. Those predicted grasps can be translated to different robot hands, letting a robot attempt objects it has never seen before — what the field calls zero-shot transfer.

The gap between human and robot dexterity has long been a hard ceiling on automation. Most grasping systems rely on either controlled datasets with limited object variety or expensive simulation pipelines that struggle to reflect how people actually handle things. Pulling egocentric video from smart glasses is a practical shortcut: the data is cheap to collect, naturally diverse, and grounded in how objects get used in the real world.

HUG beat existing baselines by 23% and 34% on its own 90-object benchmark, HUG-Bench — though benchmarks designed by the same team that built the model are worth treating with mild skepticism until independent groups replicate the results.

← Back to the front page