SuperThoughts lets large language models emit two reasoning tokens at once.
The authors fine‑tuned four Qwen2.5‑Math instruction models with a lightweight Multi‑Token Prediction (MTP) module that packs each pair of consecutive chain‑of‑thought (CoT) tokens into a single latent vector. At inference time the model decodes two tokens per step, cutting the effective CoT length by roughly 20‑30 %. A confidence‑based fallback reverts to standard decoding when the MTP signal is weak. Tests on MATH500, AMC, OlympiadBench, and GPQA‑Diamond show a 1‑2 point drop in accuracy on most benchmarks.
Doubling token throughput directly reduces compute time and cloud cost for long‑form reasoning, a known bottleneck for LLMs tackling math or logic problems. The approach keeps discrete token supervision, sidestepping the instability that plagues fully latent‑space reasoning methods.
It is a modest speed‑up rather than a breakthrough; the gain depends on the model’s ability to predict paired tokens accurately.
In short, SuperThoughts offers a practical way to shave latency from heavy CoT workloads, but the 1‑2 point accuracy dip and the need for a reliable confidence check mean it remains a trade‑off rather than a universal solution.