Picking the wrong practice problems is costing large language models performance during reinforcement learning training.
Researchers introduced a framework called Bayesian Manifold Curriculum (BMC) that treats problem selection as something more than a difficulty dial. Instead of ranking prompts by how hard they are and feeding models a steady diet of medium-hard questions, BMC maps problems onto the model's own internal representation space — a geometric structure the authors call a manifold — and uses that map to build a hierarchical task tree. Bayesian learning then guides which problems get sampled next. The key empirical finding: difficulty-first sampling forces a tradeoff between productivity (how strong the learning signal is), diversity (how broadly the training covers the problem space), and utility (how well any of it transfers to evaluation benchmarks).
The standard approach — treating each training problem as an independent arm in a bandit problem and pulling whichever looks hardest-but-not-too-hard — ignores the fact that problems are related to each other through what the model already knows. BMC exploits that structure, which means the training signal can be steered deliberately rather than discovered accidentally. For anyone building reasoning-focused models, that distinction could matter at scale, where wasted compute compounds quickly.
Difficulty-based curriculum methods are common in RL-for-reasoning pipelines, but this paper adds to a growing body of work suggesting that the type and coverage of training problems matters as much as their rank on a hardness scale — a finding that should make labs rethink what "optimal sampling" actually means.