Reading AI Models Without Training a Decoder

Researchers say you can read what a neural network is thinking without teaching it anything new.

A paper from arXiv proposes "Bag of Dims," a method for extracting meaningful features from transformer models by examining the sign (positive or negative) of individual dimensions in hidden states. No rotation, no learned probes, no GPU-days of optimization. The researchers tested the approach across seven models covering language, vision, and audio - including Qwen, Gemma, Mistral, DINOv2, and an audio spectrogram transformer. Sign patterns alone preserved 60-93% top-5 next-token accuracy, and the method detected 175 semantic categories with AUC scores between 0.97 and 0.99 from a single forward pass.

Mechanistic interpretability - the field trying to reverse-engineer what neural networks actually do internally - has mostly relied on sparse autoencoders and learned rotations that require significant compute and are specific to one model at a time. If sign patterns generalize across architectures and modalities as claimed here, that would shift the cost of interpretability work from expensive training runs to cataloging what each dimension encodes, which is considerably cheaper.

The authors also claim the features are causally operative: flipping a dimension's signs during inference suppresses the associated concept. That is a stronger claim than correlation, and it will need independent replication. But as a training-free baseline, Bag of Dims is a provocation worth taking seriously.

← Back to the front page