- ISE, a new dataset and generation framework, raises pass@1 on tool‑use benchmarks from 19.3 % to 37.7 %.
What actually happened: Researchers built a three‑stage pipeline—Intent, Simulate, Execute—to produce realistic OS‑agent trajectories. Stage 1 generated 43,956 unique, structured intents across personas, domains, tasks and complexity levels. Stage 2 ran a role‑locked user simulator that anchored each user turn in actual execution outcomes, yielding 23,132 complete, multi‑turn dialogues with an average of 8.12 user turns. Stage 3 executed every tool call in an isolated OS workspace, capturing real failure‑recovery dynamics. Fine‑tuning Qwen3‑8B on the resulting ISETrace set lifted its ClawEval pass@1 to 37.7 %, beating zero‑shot GPT‑4o and even a four‑times larger Qwen3‑32B base model.
Why it matters: Existing agent datasets lack the blend of intent structure, turn‑by‑turn interaction and genuine execution feedback that real‑world assistants need. By stitching these elements together, ISE provides the missing training signal, and the performance jump shows that fidelity matters more than model size alone. The ablation confirms that the multi‑turn simulation contributes the bulk of the gain, suggesting future work should focus on interaction realism.
The release also includes all code and data, lowering the barrier for other labs to replicate or extend the approach. Competitors will now have a benchmark that reflects actual OS constraints, not just synthetic replies.
In short, ISE demonstrates that carefully engineered data pipelines can narrow the gap between lab benchmarks and real‑world agent competence, without resorting to ever larger models.