AI/ ai · machine-learning · hardware · research

A Better 4-bit Recipe for Training Giant AI Models

Researchers propose UFP4, a uniform 4-bit training format that avoids a systematic rounding flaw baked into the FP4 standard most GPU makers are shipping.

A new pretraining recipe challenges the 4-bit floating-point format that NVIDIA and AMD have built into their latest AI accelerators.

Researchers publishing on arXiv identified a flaw they call Shrinkage Bias in the E2M1 data format — the 4-bit standard baked into NVIDIA Blackwell and Rubin-class hardware, as well as AMD's MI350 GPUs. The bias stems from geometric asymmetry in how E2M1 represents numbers: rounding errors consistently skew negative, and those errors compound across every layer of a neural network. The paper also shows that the Random Hadamard Transform, a technique commonly used to improve quantization quality, actually amplifies this bias rather than canceling it — explaining instability that practitioners have observed but not fully diagnosed.

The alternative the researchers propose, UFP4, swaps E2M1 for a uniform 4-bit grid (E1M2 or INT4-style) that sidesteps the geometric problem entirely. Tested on models ranging from a dense 1.5-billion-parameter network to a mixture-of-experts model at 124 billion parameters, UFP4 consistently produced smaller loss degradation relative to full-precision BF16 training than comparable E2M1 baselines. That matters because 4-bit training is one of the few remaining levers for cutting memory and compute costs without shrinking the model itself.

NVIDIA and AMD have already committed silicon to E2M1 as a first-class format — meaning any course correction runs into hardware that is already shipping. The paper's implicit ask — that future accelerators treat uniform 4-bit grids as equals to E2M1 — is a reasonable one, but it arrives after the molds were cast.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →