Researchers say they have found a way to train massive language models across geographically separated data centers without keeping a full copy of the model at every site.
Pre-training large language models typically requires tightly coupled, high-speed hardware inside a single data center. Mixture-of-Experts architectures eased the compute burden by decoupling parameter count from active computation, but they still hit a wall: existing distributed approaches like DiLoCo and Photon need a complete model replica at each node, which turns memory and bandwidth into hard limits. FoMoE breaks that constraint by partitioning expert layers across workers rather than duplicating them everywhere. The paper reports up to a 1.42x reduction in communication costs over those efficient baselines, and up to 45.44x over standard distributed data-parallel training — a figure that reflects a much lower bar, since DDP does not apply any of the same optimizations.
The practical stakes are real: whoever can train frontier models on loosely connected, commodity infrastructure has a large cost advantage over labs that depend on dense GPU clusters. FoMoE also claims up to 1.4x throughput gains via a skip-token mechanism, and the authors project the memory and communication benefits to 100-billion-parameter scale through system modeling — though those larger numbers are projections, not measured results.
FoMoE follows a line of work — DiLoCo from Google DeepMind, Photon shortly after — that treats geographic distribution as a first-class training constraint rather than an afterthought. If the 100B projections hold up in practice, the gap between what a well-funded lab and a well-organized coalition of smaller operators can train may narrow considerably.