Smarter Token Merging Cuts Compute Costs for Image AI

A research team has a new way to make image-generating AI cheaper to run without wrecking output quality.

Latent diffusion models — the engine behind most high-quality AI image generators — convert images into compressed numerical tokens before processing. The problem: existing systems use a fixed compression ratio, so every image costs roughly the same compute regardless of complexity. Variable-length tokenizers tried to fix this by trimming token sequences, but trimming scrambles the meaning of each token based on its position, creating mismatches the model can't reconcile. The new approach, called learnable global merging, combines similar tokens instead of cutting them, preserving representational alignment across different compression levels.

The distinction matters because it means a single model can serve multiple quality-compute tiers without retraining at each setting — a practical advantage for anyone running inference at scale. On the ImageNet 256x256 benchmark, the merging-based tokenizer outperforms prior variable-length methods on the standard quality-versus-compute trade-off metric.

Diffusion model efficiency has become a quiet arms race: teams at major labs and startups are hunting for any technique that cuts inference cost without visibly degrading output. Token merging is not a new idea in vision transformers, but applying it as a learnable, data-independent global operation inside a diffusion pipeline is a meaningful wrinkle. The code is public on GitHub, so the real test is whether practitioners find it easy enough to adopt outside a controlled benchmark.

← Back to the front page