A research team has released ScholarQuest, a benchmark designed to measure how well AI agents actually find academic papers in realistic conditions.
ScholarQuest draws on over 1,000 computer science topics and four types of research queries: method-oriented, setting-anchored, comparison-based, and scope-controlled. The benchmark includes a shared retrieval backend called ScholarBase to make results reproducible across different systems. Crucially, it tests agents in open literature environments — not curated, closed datasets that flatter the models being evaluated.
The results are a reality check. The best-performing agent hit 0.314 Recall@100 and 0.355 Recall@All, meaning it missed roughly two-thirds of relevant papers even when given 100 tries. Agentic methods — iterative, intent-driven search loops — did outperform single-shot retrieval, which at least validates the architectural direction even if the scores are modest.
For context, the AI research community has leaned heavily on agent-based pipelines as a fix for shallow retrieval, but this benchmark suggests the plumbing still leaks. Anyone selling an AI-powered literature review tool today is working with systems that miss most of what they're looking for.