AI/ ai · research · benchmarks · llm

AI Agents Still Struggle to Search Academic Literature

A new benchmark called ScholarQuest tests AI research agents on real literature search tasks — and the best systems top out at 0.314 Recall@100.

A research team has released ScholarQuest, a benchmark designed to measure how well AI agents actually find academic papers in realistic conditions.

ScholarQuest draws on over 1,000 computer science topics and four types of research queries: method-oriented, setting-anchored, comparison-based, and scope-controlled. The benchmark includes a shared retrieval backend called ScholarBase to make results reproducible across different systems. Crucially, it tests agents in open literature environments — not curated, closed datasets that flatter the models being evaluated.

The results are a reality check. The best-performing agent hit 0.314 Recall@100 and 0.355 Recall@All, meaning it missed roughly two-thirds of relevant papers even when given 100 tries. Agentic methods — iterative, intent-driven search loops — did outperform single-shot retrieval, which at least validates the architectural direction even if the scores are modest.

For context, the AI research community has leaned heavily on agent-based pipelines as a fix for shallow retrieval, but this benchmark suggests the plumbing still leaks. Anyone selling an AI-powered literature review tool today is working with systems that miss most of what they're looking for.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →