Researchers found that large robot-control AI models can lose half their layers and still perform as well as the originals.
Vision-language-action models — the foundation models that teach robots to manipulate objects — are typically multi-billion-parameter systems trained on massive video and sensor datasets. A new paper shows that models like pi_0 and GR00T-N1.5 contain significant layer-wise redundancy despite that training breadth. The researchers built a compression pipeline that requires no additional training: a single forward pass using Centered Kernel Alignment — a mathematical tool for comparing internal representations — identifies duplicate layers, which are then permanently removed. The result is a model up to 50% shallower, with 40-50% faster fine-tuning and up to 30% faster real-time inference.
The practical implication is that robotics teams spending heavily on GPU time to fine-tune these models may be paying for redundancy, not capability. Smaller research groups and hardware-constrained deployments become more viable if the performance ceiling stays intact — and the paper claims it does, validated across three simulation benchmarks and 10 real-world tasks on four different robot platforms.
This fits a broader pattern in deep learning where massive pre-trained models turn out to be over-parameterized for downstream tasks — a dynamic well-documented in large language models through pruning and distillation research. The robotics-specific wrinkle is that continuous control is notoriously sensitive to architectural changes, so matching full-model performance at half the depth is a stronger result than the same claim would be in a text classification setting.