A new pipeline can extract readable, named skills from AI agent interaction logs — but those skills don't actually make agents any better at new tasks.
Researchers built a three-stage system that mines GUI trajectories — the click-by-click recordings of agents completing tasks — breaks them into segments, clusters those segments into candidate skills, and then trains a new agent policy on the resulting annotations. The mined clusters look coherent: five of eight clusters scored at least 0.95 purity against InteraSkill Workflows labels. The team then trained an agent using GRPO, a reinforcement learning method, on those skill annotations and tested against two benchmarks.
The results are candid about the limits. GRPO improved IW skill-step accuracy from 18.5% to 20.5% — a two-point gain that barely beats a simple frequency prior — and left BrowseComp+ scores essentially unchanged. The paper's core finding is that readable structure in mined data doesn't guarantee that structure is the right one for policy learning.
The researchers call this a diagnostic study rather than a system to deploy, which is either unusual modesty in an era of AI benchmark racing, or a sign that the gap between "a human can read this skill library" and "an agent can learn from it" is wider than the field has assumed.