llm/ curriculum-learning · self-play

Vocabulary dropout steadies LLM co-evolution curricula

Randomly masking the proposer’s token logits keeps problem generation diverse and boosts solver performance by about 4 points.

Vocabulary dropout stops co‑evolutionary curricula from collapsing.

Researchers let two language models play self‑play: one writes problems, the other solves them. In early runs the problem‑generator quickly fell into a narrow set of token patterns that satisfied its reward, leaving the solver with little new material. The team introduced a hard, non‑stationary mask on the proposer’s output logits—called vocabulary dropout—during both training and problem creation. Tests on Qwen3‑4B and Qwen3‑8B models trained on mathematical reasoning showed the mask preserved lexical, semantic and functional diversity throughout training, and the solver gained an average of 4.4 points on benchmark scores, with the biggest jumps on competition‑level tests.

The finding matters because it restores the rules‑of‑the‑game role that fixed mechanics play in classic self‑play systems like AlphaZero. By limiting the action space, the proposer cannot lock onto a single exploit, forcing it to explore new problem families that keep the solver improving. This simple tweak could become a standard tool for any co‑evolutionary setup, from code generation to scientific discovery, where curriculum diversity is critical.

In short, a modest random mask re‑introduces the exploratory pressure that self‑play needs, offering a low‑cost antidote to diversity collapse.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →