- AC-ODM, a reinforcement‑learning‑driven data mixer, trims the number of training steps needed for large language models.
The authors frame data composition as a policy problem and train an actor‑critic model to weight training samples dynamically. Two modes are offered: a proxy mode that learns on a small model and transfers the policy to a larger target, and a non‑proxy mode that learns from scratch. On the Pythia‑1B benchmark the method reaches optimal validation perplexity with up to 66% fewer steps than prior mixers. Reported gains include a 27.5% lift in MMLU accuracy and a 2.23× increase in HumanEval pass@1, while wall‑clock time per step rises only 0.4% and memory use climbs 2%.
If data selection truly dominates pre‑training efficiency, shrinking step counts translates into lower energy bills and faster model roll‑outs. The approach also promises plug‑and‑play flexibility across pipelines, a rare trait among recent adaptive‑mixing schemes.
In sum, AC‑ODM delivers faster convergence and better downstream performance with minimal overhead, but its gains are demonstrated on a single 1B‑parameter model; broader scaling tests will determine if the benefits hold for today’s multi‑billion‑parameter LLMs.