A Smarter Way to Train LLMs to Reason

A new training framework targets the specific tokens that matter most during LLM reasoning — and ignores the rest.

Researchers introduced the Independent Combinatorial Tokens (ICT) framework to address a known tension in reinforcement learning for language models. Standard approaches either update all tokens equally, which causes the model to over-commit to early strategies, or push too hard for entropy maximization, which sends exploration off the rails into incoherent outputs. ICT sidesteps this by measuring how much each token's probability distribution diverges from the baseline — using Jensen-Shannon divergence — and flagging the outliers as key decision points worth training on. Only the top 10% of these distinctive tokens get updated.

The practical result matters because reasoning benchmarks are where frontier labs currently compete hardest. Testing on Qwen2.5 models at three sizes (0.5B, 1.5B, and 7B parameters), ICT delivered an average pass@4 improvement of 4.58% over existing baselines — GRPO, 20-Entropy, and STAPO — across seven benchmarks covering math, commonsense, and Olympiad-level problems, with a peak gain of 14.9% on individual tests. That's a meaningful jump without changing the model architecture at all.

The broader implication is that selective gradient updates, not brute-force reinforcement sweeps, may be the more efficient path to better reasoning. Whether the gains hold at the 70B-plus scale that actually ships in products remains the open question.

← Back to the front page