- New version of representation autoencoders (RAEv2) slashes convergence time and improves image quality.
The authors replace the vanilla VAE encoder with a sum of the last k layers from a pretrained vision model. This tweak alone lifts reconstruction without any finetuning. They also show that using the same pretrained representation for both the encoder and for intermediate diffusion layers—what prior work called REPA—adds a complementary signal. Finally, they repurpose REPA as a built‑in guidance method, eliminating the need for a second diffusion model.
Why it matters: training diffusion models has been a numbers game, with state‑of‑the‑art results requiring thousands of GPU hours. RAEv2 reaches a gFID of 1.06 on ImageNet‑256 after only 80 epochs, a ten‑fold speed improvement over the original RAE. On the FDr6 benchmark it beats the previous best (3.26) with 2.17 at the same epoch count, and it does so without any post‑training tricks. The authors propose EPFID@k, measuring epochs needed to hit a target gFID, as a more practical efficiency metric.
The result is a faster, simpler pipeline that could make high‑quality diffusion more accessible, especially for groups without massive compute budgets. Whether this approach scales to larger models or more exotic modalities remains to be seen.