AI/ ai · machine-learning · model-compression · computer-vision

STORM Fixes a Blind Spot in Vision Mamba Token Pruning

A training-free framework called STORM recovers up to 63.3% of the accuracy that vision Mamba models lose when tokens are compressed.

Compressing vision AI models is supposed to make them faster — but a new paper shows that popular compression methods quietly destroy accuracy in a specific class of models, and proposes a fix that requires no retraining.

Researchers identified a structural flaw in how existing token reduction methods handle Mamba-based vision models. Mamba, an architecture built for efficient long-sequence processing, relies on a selective scanning mechanism that assumes a two-dimensional spatial grid. Standard token reduction methods ignore that assumption entirely, stripping tokens in ways that shatter the grid topology the model depends on. The result: severe accuracy collapse on models like VMamba. The proposed fix, called STORM (Spatial-aware Token Reduction fraMework), reformulates token pruning as an operation on spatial units rather than individual tokens, preserving neighborhood structure throughout compression. Because it works as a plug-and-play module on existing pipelines, STORM requires no additional training.

The gap this exposes matters beyond one paper. Mamba variants have attracted serious interest as an alternative to transformers for vision tasks, partly because they handle long sequences more efficiently. If token reduction — a standard technique for deployment on constrained hardware — reliably breaks these models, that is a real barrier to shipping them. A training-free patch that recovers most of the lost accuracy changes that calculus.

The VMamba result, a 63.3% improvement in top-1 accuracy over prior reduction methods, is striking enough to invite scrutiny — benchmarks this clean often come with favorable experimental conditions. Whether STORM holds up across a wider range of tasks and hardware targets is the next question.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →