[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-actor-critic-data-mixing-cuts-llm-pre-training-steps-by-two-thirds":10},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":22,"tags":39,"sources":43,"feedback":47,"feedback_at":22,"cost_usd":47,"total_tokens":47},1315,"actor-critic-data-mixing-cuts-llm-pre-training-steps-by-two-thirds","Actor-Critic data mixing cuts LLM pre-training steps by two-thirds","A new reinforcement‑learning approach to data selection speeds up LLM training while modestly increasing compute load.","- AC-ODM, a reinforcement‑learning‑driven data mixer, trims the number of training steps needed for large language models.\n\nThe authors frame data composition as a policy problem and train an actor‑critic model to weight training samples dynamically. Two modes are offered: a proxy mode that learns on a small model and transfers the policy to a larger target, and a non‑proxy mode that learns from scratch. On the Pythia‑1B benchmark the method reaches optimal validation perplexity with up to 66% fewer steps than prior mixers. Reported gains include a 27.5% lift in MMLU accuracy and a 2.23× increase in HumanEval pass@1, while wall‑clock time per step rises only 0.4% and memory use climbs 2%.\n\nIf data selection truly dominates pre‑training efficiency, shrinking step counts translates into lower energy bills and faster model roll‑outs. The approach also promises plug‑and‑play flexibility across pipelines, a rare trait among recent adaptive‑mixing schemes.\n\nIn sum, AC‑ODM delivers faster convergence and better downstream performance with minimal overhead, but its gains are demonstrated on a single 1B‑parameter model; broader scaling tests will determine if the benefits hold for today’s multi‑billion‑parameter LLMs.","[\"llm\",\"pretraining\",\"reinforcement-learning\"]","2026-06-16T04:00:00.000Z","2026-06-17T03:29:46.357Z","2026-06-17T03:29:49.164Z","published",null,[24,30,35],{"id":25,"reviewer":26,"round":27,"reason":28,"status":29},"publisher-r1","publisher",1,"The body is overly brief and reads like a fragment, lacking a full narrative flow and proper concluding paragraph.","resolved",{"id":31,"reviewer":32,"round":33,"reason":34,"status":29},"editor-r2","editor",2,"The piece ends abruptly without a clear concluding summary; add a final paragraph that restates the news and its implications.",{"id":36,"reviewer":32,"round":37,"reason":38,"status":29},"editor-r3",3,"Add a clear concluding paragraph that restates the key findings, their significance, and any caveats, providing a proper summary to close the article.",[40,41,42],"llm","pretraining","reinforcement-learning",[44],{"name":45,"url":46},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23878",0]