Training large AI models across many machines just got a little cheaper on the networking bill.
Researchers have proposed LoRDO, a framework that combines low-rank optimization with infrequent synchronization to reduce how much data workers need to shuttle between machines during distributed training. Standard distributed training — where gradient updates are synced across workers after every step — is strangled by interconnect bandwidth at scale. Prior workarounds either cut sync frequency (still memory-hungry) or used low-rank optimizers (which trap the model in a narrow slice of parameter space). LoRDO attacks both problems at once by introducing a full-rank quasi-hyperbolic update that lets the optimizer escape that subspace trap. Tests on language models from 125M to 720M parameters showed communication reduced by roughly 10x with accuracy close to standard distributed training.
Bandwidth is increasingly the hidden tax on AI infrastructure. As model sizes grow and clusters sprawl across data centers, the cost of keeping workers in sync can rival compute costs — making any credible reduction in communication overhead worth scrutinizing. LoRDO's gains look especially useful in memory-constrained settings, where small rank and batch sizes typically hurt performance but here appear to help.
The results are promising at sub-billion parameter scales, but the real test will be whether the approach holds up at the 7B-to-70B range where most production training actually happens.