[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-ise-dataset-lifts-toolusing-agent-pass1-to-377":10},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":22,"tags":30,"sources":34,"feedback":38,"feedback_at":22,"cost_usd":38,"total_tokens":38},1423,"ise-dataset-lifts-toolusing-agent-pass1-to-377","ISE dataset lifts tool‑using agent pass@1 to 37.7%","A three‑stage synthesis pipeline creates realistic multi‑turn OS‑agent data, doubling pass@1 scores for a 8‑billion‑parameter model.","- ISE, a new dataset and generation framework, raises pass@1 on tool‑use benchmarks from 19.3 % to 37.7 %.\n\nWhat actually happened: Researchers built a three‑stage pipeline—Intent, Simulate, Execute—to produce realistic OS‑agent trajectories. Stage 1 generated 43,956 unique, structured intents across personas, domains, tasks and complexity levels. Stage 2 ran a role‑locked user simulator that anchored each user turn in actual execution outcomes, yielding 23,132 complete, multi‑turn dialogues with an average of 8.12 user turns. Stage 3 executed every tool call in an isolated OS workspace, capturing real failure‑recovery dynamics. Fine‑tuning Qwen3‑8B on the resulting ISETrace set lifted its ClawEval pass@1 to 37.7 %, beating zero‑shot GPT‑4o and even a four‑times larger Qwen3‑32B base model.\n\nWhy it matters: Existing agent datasets lack the blend of intent structure, turn‑by‑turn interaction and genuine execution feedback that real‑world assistants need. By stitching these elements together, ISE provides the missing training signal, and the performance jump shows that fidelity matters more than model size alone. The ablation confirms that the multi‑turn simulation contributes the bulk of the gain, suggesting future work should focus on interaction realism.\n\nThe release also includes all code and data, lowering the barrier for other labs to replicate or extend the approach. Competitors will now have a benchmark that reflects actual OS constraints, not just synthetic replies.\n\nIn short, ISE demonstrates that carefully engineered data pipelines can narrow the gap between lab benchmarks and real‑world agent competence, without resorting to ever larger models.","[\"agent-training\",\"datasets\",\"machine-learning\"]","2026-06-16T04:00:00.000Z","2026-06-17T09:03:38.308Z","2026-06-17T09:03:41.132Z","published",null,[24],{"id":25,"reviewer":26,"round":27,"reason":28,"status":29},"editor-r1","editor",1,"Add a concise concluding paragraph summarising the news and its implications, as the draft currently ends abruptly without a clear wrap‑up.","resolved",[31,32,33],"agent-training","datasets","machine-learning",[35],{"name":36,"url":37},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.11520",0]