New Benchmark Finds AI Far From Doing Science on Its Own

A new benchmark puts a number on how far AI agents are from conducting independent scientific research — and the number is not flattering.

ResearchClawBench evaluates autonomous research systems on 40 tasks drawn from 10 scientific domains. Each task is anchored to a real published paper: the agent receives related literature and raw data but not the target paper itself, then tries to re-derive the same findings. Expert-curated rubrics score the output against weighted criteria. Seven dedicated auto-research agents and seventeen general-purpose LLMs were tested under the same protocol. The strongest autonomous agent averaged 21.5; the strongest LLM averaged 20.7; the mean across the LLM frontier was 26.5.

The benchmark matters because it targets something most AI evaluations sidestep: the full research loop, from data to conclusion, without hand-holding. Most coding and reasoning benchmarks stop at whether a model can write correct code or solve a well-posed problem. ResearchClawBench asks whether an agent can do what a grad student does — read the field, run experiments, and arrive at a defensible result. At scores in the low 20s, the answer is clearly not yet.

Error analysis found that failures cluster around experimental protocol mismatches, evidence gaps, and what the authors call "missing scientific core" — the agent arrives somewhere, just not where the paper did. That last category is the hard one: it suggests the problem is not purely execution, but scientific reasoning itself. Labs have been quick to claim research acceleration as a near-term AI payoff; benchmarks like this one are a useful reminder that a coding agent that can write a data pipeline is not the same thing as a scientist.

← Back to the front page