A new paper reframes what drives grokking, the phenomenon where neural nets suddenly generalize long after they appear to have only memorized training data.
Researchers fixed weight norms using clamping, then varied output temperature independently. That let them slide the grokking delay across its full range without touching the norm itself. Matching the effective logit scale back to a baseline recovered about 85% of the delay across two moduli. When they mapped delay against a grid of norms and temperatures together, logit scale alone explained the variance with an R-squared of 0.97; the norm contributed only 1-2% on top. The effect is also loss-function-dependent: under mean-squared error, the logit scale stays fixed and the norm takes a different path, which tells you the cross-entropy result is not a universal law.
The distinction matters because most grokking research treats weight norm as the causal lever and regularization as the prescription. If the norm is only an upstream handle on logit scale and the softmax saturation it produces, then interventions aimed at the norm may be solving the wrong variable - and researchers tuning weight decay to speed generalization may be one step removed from what actually works.
The team also ran a float64 softmax-collapse audit and tested a no-LayerNorm transformer to close off alternative explanations; a forking-arms experiment confirmed the delay follows the held norm value, not the clamping operation itself, ruling out a rescaling artifact. All results reproduce from released code and data - a detail worth noting when mechanistic interpretability papers often do not.