On-policy distillation (OPD) trains large language models in a narrow low‑dimensional channel of weight space.
The arXiv paper maps OPD updates across three regimes: supervised fine‑tuning (SFT), reinforcement learning with verifiable rewards (RLVR), and OPD itself. Compared with SFT, OPD touches fewer parameters and steers clear of principal directions; compared with RLVR, its updates are less tightly bound. Early in training, OPD’s cumulative changes lock into a subspace that remains sufficient for its performance, while SFT collapses when forced into the same subspace. Sparsifying update tokens or generating rollouts off‑policy leaves the rank dynamics unchanged, but mixing OPD with RLVR disrupts them.
The finding matters because it shows OPD is not a midpoint between SFT and RLVR but creates a distinct geometry that could be exploited for cheaper training or better control of model behaviour. If future work can harness this locked subspace, practitioners may achieve OPD‑level reasoning without the full computational cost of standard pipelines.
For now, the results temper hype and suggest that OPD’s appeal lies in its unique update path, not just in borrowing tricks from other methods.