One Architecture to Rule CNNs, Transformers, and RNNs

A new preprint claims the three dominant neural network families are all special cases of the same math.

The paper, posted to the arXiv preprint server without named institutional affiliation in the source, introduces the Integral Transform Network (ITNet). The core idea: convolutions, self-attention (including multi-head), and autoregressive recurrence — covering LSTMs, GRUs, S4, and Mamba — are not fundamentally distinct operations but rather different parameterizations of a single learnable kernel. That kernel is implemented as a small MLP that models pairwise interactions between positions and features. The authors also claim ITNet is a universal approximator of continuous operators, a theoretical property that covers a lot of ground.

If the results hold up, the implications are real. A single architecture that matches or exceeds specialized baselines across vision (ImageNet-1K), language (GLUE), 3D point clouds (ModelNet40), and visual question answering (VQA v2, NLVR2) would reduce the pressure to pick the right inductive bias upfront — and could simplify multi-modal pipelines that currently stitch together separate model families. The efficiency story matters too: the team developed tiled kernel fusion and importance-weighted Monte Carlo integration specifically to keep the approach practical at scale.

The obvious caveat is that this is an unreviewed preprint, and grand unification claims in deep learning have a mixed track record. The benchmark results are promising, but the architecture research community will want to see this stress-tested on tasks where the specialized models were purpose-built to excel.

← Back to the front page