Fast-dLLM++ adds 37% speed to diffusion LLM inference

Fast-dLLM++ squeezes extra speed out of diffusion large language models.

Fast-dLLM++ replaces the original Fast-dLLM decoder with a "Fréchet profile decoding" rule. Instead of basing parallel token commits on the single worst‑case confidence, it scans the whole sorted confidence profile and picks a set that respects heterogeneous confidence levels. The change needs no model retraining, no alteration to the diffusion process, and no cache redesign – it drops in where Fast-dLLM already runs. Benchmarks on GSM8K, MATH, HumanEval and MBPP using the LLaDA‑8B model show up to 37% higher throughput at comparable accuracy.

The gain matters because diffusion LLMs have long been stuck on a serial bottleneck despite their parallel generation promise. By harvesting safely parallelizable tokens that were previously blocked by a conservative rule, practitioners can run larger workloads on existing hardware.

In short, the 37% speed boost narrows the gap between diffusion LLM theory and practice; we can expect the community to adopt the method quickly while researchers probe further confidence‑aware decoding tricks.

← Back to the front page