AI Peer Review Systems Benchmarked Against Real Papers

AI-assisted peer review is being tested against the very problem it was built to solve.

Researchers benchmarked three agentic review systems — two open-source (OpenAIReview and coarse) and one proprietary (Reviewer3) — plus a zero-shot baseline, running each across six large language models ranging from frontier to efficient tiers. On ICLR and NeurIPS papers, every system beat chance at predicting paper quality using citations and acceptance decisions as proxies. The top performer, OpenAIReview paired with GPT-5.5, hit 83.0% pairwise accuracy. For error detection, the team built a perturbation benchmark injecting four categories of mistakes into papers across eight arXiv subject classes; that same top configuration caught 71.6% of them. Pooling detections across all six models lifted recall to 83.3%, suggesting the models are finding different errors rather than the same ones.

That ensemble finding is the most practically interesting result here. It implies that no single model configuration is close to ceiling, and that better harness design — not just bigger models — could push detection meaningfully higher. For academic publishers under pressure from AI-generated submissions, that is a more actionable signal than raw accuracy numbers.

A real-user deployment of OpenAIReview logged a positive-to-negative vote ratio of 1.44 to 1, with false positives and minor nitpicks as the top complaints — a familiar trade-off for anyone who has shipped an automated linter. The field is still early, but the gap between 71.6% and 100% is not a rounding error.

← Back to the front page