The best large language models score barely above a coin flip on a new reasoning benchmark designed to test whether AI can actually think about rules, not just apply them.
Researchers have released HOLMES, a benchmark of 1,379 problems drawn from law and finance that require higher-order logical reasoning — meaning the model must reason about predicates, functions, and constraints themselves, not just draw conclusions from fixed facts. Current benchmarks lean heavily on first-order logic, where a model matches objects to predicates. HOLMES asks models to reason one level up: about the rules governing those predicates. Across tested models, average accuracy landed at 50.64%, with the strongest performer reaching 59.54%. The dataset and code are publicly available on GitHub.
The gap between first-order and higher-order performance matters because real-world legal and financial reasoning constantly requires models to adjudicate between competing rules, handle scope conditions, and compose constraints — exactly the tasks where HOLMES shows sharp drops. The researchers also flag a subtler problem: high final-answer accuracy can hide shortcut reasoning, where a model lands on the right answer for the wrong reasons, particularly in conflict-resolution scenarios.
For context, leading models routinely clear 90% on older logic benchmarks like LogiQA 2.0, which has fed a narrative of steady reasoning progress. HOLMES suggests that narrative was partly an artifact of easier tests — a familiar pattern in AI benchmarking, where the bar rises only after someone builds a harder bar.
