- SkillsBench measures how procedural skill packages affect LLM agents across 87 tasks in eight domains.
The researchers ran each task twice: once with no added skills and once with a curated set of skill modules, testing 18 model‑harness configurations. Without skills the average pass rate was 33.9%. With curated skills it climbed to 50.5%, a 16.6‑point lift or a 25.5% normalized gain. Gains varied per configuration, from 4.1 to 25.7 points. Notably, compact skill bundles of three modules outperformed larger, exhaustive collections, and a small model equipped with skills matched the performance of a larger model lacking them.
This matters because developers have been adding skills to agents without a clear way to gauge impact. The benchmark offers a paired‑evaluation protocol that quantifies benefit, encouraging more disciplined tool‑use. It also suggests that targeted skill sets can offset hardware limitations, a potential cost saver for enterprises.
In short, SkillsBench proves that well‑chosen skill modules are not a soft add‑on but a measurable lever, and future LLM agents will likely be judged by the efficiency of their skill libraries rather than raw model size alone.