Transformer Blocks Are Not Equally Nonlinear

Not every layer of a transformer is doing the hard nonlinear work you might assume.

Researchers tested feed-forward network blocks across GPT-2, Pythia-160m, and Llama-160m, measuring how well a simple linear approximation could reconstruct each block's output. They called this metric "linear recoverability" and found it varies wildly - not just across models, but between adjacent layers within the same model. Some blocks scored above 0.99 (nearly linear); others fell below 0.3 (strongly nonlinear). Critically, GPT-2 and Pythia-160m share the same activation function and width, yet show sharply different recoverability profiles. That means the linearity of a block is a product of training, not a fixed architectural property.

This matters for anyone trying to compress or prune large language models. The study shows that highly recoverable blocks can be replaced with far simpler single-layer approximations - GPT-2's early feed-forward block was compressed to one-eighth the parameters with only a 0.77 perplexity increase. Blocks with low recoverability, by contrast, resist this treatment and flag where aggressive compression will hurt. That is a more targeted signal than the blunt per-layer heuristics most pruning pipelines use today.

The researchers also flag a quiet methodological trap: trained linear baselines often under-converge on transformer activations because the inputs are ill-conditioned, making past linearity estimates unreliable. The closed-form least-squares approach sidesteps that problem entirely - which suggests some prior work on transformer internals may need revisiting.

← Back to the front page