diffusion-llm/ inference · performance

Fast-dLLM++ adds 37% speed to diffusion LLM inference

A training-free decoding tweak exploits confidence heterogeneity to boost diffusion LLM throughput by up to 37% without hurting accuracy.

  • Fast-dLLM++ squeezes extra speed out of diffusion large language models.

Fast-dLLM++ replaces the original Fast-dLLM decoder with a "Fréchet profile decoding" rule. Instead of basing parallel token commits on the single worst‑case confidence, it scans the whole sorted confidence profile and picks a set that respects heterogeneous confidence levels. The change needs no model retraining, no alteration to the diffusion process, and no cache redesign – it drops in where Fast-dLLM already runs. Benchmarks on GSM8K, MATH, HumanEval and MBPP using the LLaDA‑8B model show up to 37% higher throughput at comparable accuracy.

The gain matters because diffusion LLMs have long been stuck on a serial bottleneck despite their parallel generation promise. By harvesting safely parallelizable tokens that were previously blocked by a conservative rule, practitioners can run larger workloads on existing hardware.

In short, the 37% speed boost narrows the gap between diffusion LLM theory and practice; we can expect the community to adopt the method quickly while researchers probe further confidence‑aware decoding tricks.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →