Transformers keep losing to basic linear models on time-series tasks — and researchers think they finally know why.
The culprit, according to a new paper, is softmax attention's built-in constraint: it can only blend inputs using positive weights that sum to one. That's a convex combination, and convex combinations can't represent the oscillatory, signed patterns — filtering, harmonic structure — that show up throughout temporal data. The researchers call this the "simplex-constrained mixing bottleneck." Their proposed fix, Temporal Operator Attention (TOA), grafts learnable sequence-space operators onto standard attention layers, adding the negative-weight mixing that time-series signals actually require. A companion technique called Stochastic Operator Regularization applies high-variance dropout to stop the operators from simply memorizing training sequences.
The stakes are higher than the jargon suggests. The gap between Transformers and simple linear models on time-series benchmarks has been a quiet embarrassment in the ML community for years — prior work showed a single-layer linear model routinely outperforming heavily tuned Transformers on standard forecasting tests. TOA doesn't abandon the architecture; it argues that softmax attention was the specific bottleneck, a more precise diagnosis than most proposed fixes. The paper reports gains across forecasting, anomaly detection, and classification tasks when TOA is integrated into existing models like PatchTST and iTransformer, with the largest improvements on reconstruction-heavy tasks.
Showing consistent benchmark improvements is not the same as closing the gap that linear models opened — but grounding the failure in a mathematical constraint rather than vague architectural limits is, at minimum, a more honest starting point.