llm/ pretraining · reinforcement-learning

Actor-Critic data mixing cuts LLM pre-training steps by two-thirds

A new reinforcement‑learning approach to data selection speeds up LLM training while modestly increasing compute load.

  • AC-ODM, a reinforcement‑learning‑driven data mixer, trims the number of training steps needed for large language models.

The authors frame data composition as a policy problem and train an actor‑critic model to weight training samples dynamically. Two modes are offered: a proxy mode that learns on a small model and transfers the policy to a larger target, and a non‑proxy mode that learns from scratch. On the Pythia‑1B benchmark the method reaches optimal validation perplexity with up to 66% fewer steps than prior mixers. Reported gains include a 27.5% lift in MMLU accuracy and a 2.23× increase in HumanEval pass@1, while wall‑clock time per step rises only 0.4% and memory use climbs 2%.

If data selection truly dominates pre‑training efficiency, shrinking step counts translates into lower energy bills and faster model roll‑outs. The approach also promises plug‑and‑play flexibility across pipelines, a rare trait among recent adaptive‑mixing schemes.

In sum, AC‑ODM delivers faster convergence and better downstream performance with minimal overhead, but its gains are demonstrated on a single 1B‑parameter model; broader scaling tests will determine if the benefits hold for today’s multi‑billion‑parameter LLMs.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →