AI/ ai · machine learning · transformers · inference

Nearby Tokens Need More Memory Than Distant Ones

A new attention architecture gives recent tokens richer representations than older ones, cutting KV cache size without hurting model performance.

Researchers propose giving different memory budgets to different parts of a transformer's context window — and it mostly works.

The technique, called Distance-Adaptive Representation (DAR), starts from a simple observation: in natural language, the next word is far more influenced by the words just before it than by tokens written dozens of sentences earlier. The standard transformer architecture ignores this, assigning the same representational dimensionality to every token in the KV cache regardless of how old it is. DAR changes that by keeping full-dimensional representations for tokens inside a local window while shrinking distant tokens to as little as one-quarter of the original dimensionality. Experiments across models ranging from 70M to 410M parameters — plus continued fine-tuning on a 1B-scale model — show DAR closely matches full-dimensional baselines.

KV cache size is one of the main costs of running large language models at inference time; it grows with context length and eats GPU memory fast. Any technique that credibly trims it without degrading output quality is worth taking seriously, especially as context windows keep expanding toward millions of tokens. The key finding here is that uniform dimensionality reduction — the naive approach — hurts performance, while position-aware reduction does not.

The paper does not claim a production-ready system, and closing the remaining gap between DAR and full baselines will take more work — but the asymmetry hypothesis it formalizes is intuitive enough that it will likely show up in future architecture designs.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →