Transformers don’t keep their eye on the ball, researchers say.
In a paper published in PNAS Nexus, J. Lee, A. Patel, and M. Zhou examined how attention heads allocate focus during sequential tasks. They ran BERT‑base and GPT‑2 on the Wikitext‑103 and GLUE benchmark suites, measuring the models’ ability to retain task‑relevant information across long inputs. The authors found that, without explicit gating, attention scores drifted toward irrelevant tokens, leading to a 15‑20% drop in downstream accuracy compared with a gated‑control variant.
This matters because most modern NLP pipelines assume that transformer attention is sufficient for executive‑level control. The findings suggest that without additional mechanisms, models may misallocate resources, especially in tasks requiring sustained context, such as document summarisation or multi‑turn dialogue.
The study adds to a growing body of work questioning the autonomy of attention and hints that future architectures may need built‑in control modules rather than relying on raw attention scores alone.
