RaBiT shows that 2‑bit large language models can run 4.49× faster than full‑precision baselines on an RTX 4090.
The authors identify a training failure they call inter‑path adaptation, where parallel binary residual paths learn the same features and waste capacity. Their solution forces each binary path to stem from a shared full‑precision weight, creating a strict error‑correction hierarchy. A specially designed initialization keeps early layers functional, preventing collapse. Benchmarks report state‑of‑the‑art accuracy for 2‑bit models, closing the gap to heavyweight vector‑quantization approaches.
If accurate low‑bit inference is to become practical, the bottleneck has often been wasted bits rather than raw compute. By eliminating redundant paths, RaBiT restores expressive power without extra hardware, making extreme quantization a viable deployment option for edge servers and desktop GPUs.
The result is a reminder that clever training tricks can sometimes outpace raw silicon upgrades.