A research team has built a fused GPU kernel that makes attention distillation fast enough to run on a single GPU at long context lengths.
Attention distillation trains one model's attention patterns to mimic another's by minimizing the gap between their probability distributions — a technique used in model compression, continual learning, and sparse-attention training. The problem: every existing method has to fully materialize both distributions before doing any math, and that materialization grows with the square of the sequence length. At long contexts, the memory bill becomes unworkable. StreamKL sidesteps this by deriving an online formulation that streams small tiles of query-key data through on-chip SRAM instead. The backward pass recomputes attention probabilities tile-by-tile rather than storing the intermediate results. The net effect is that extra high-bandwidth memory usage drops from O(N_Q N_K) — quadratic — to O(1), a constant.
The practical ceiling for knowledge distillation has long been set by GPU memory, not compute. Reducing that footprint to a constant means researchers can distill long-context models without renting a cluster, which lowers the barrier for academic labs and smaller teams considerably. Reported speedups of up to 43x on the forward pass and 14x on the backward pass suggest the gains are not marginal.
StreamKL does not make distillation universally cheap — you still need a capable teacher model and the compute to run training — but it removes one of the more concrete hardware excuses for skipping the technique altogether.