A new reasoning benchmark exposes a wide gap between what AI labs claim their models can do and what those models actually deliver under formal scrutiny.
Researchers released DeFAb, a benchmark built from four decades of publicly funded knowledge bases — including OpenCyc, YAGO, Wikidata, ConceptNet, and UMLS — covering 372,648 instances derived from 33.75 million materialized rules. The task is defeasible abduction: given an anomaly, construct a hypothesis that explains it by overriding one assumption without breaking everything else. Every correct answer is verifiable in polynomial time, so there is no ambiguity about what counts as right. A symbolic rule engine clears the benchmark in under 50 microseconds at 100% accuracy. The best frontier language model peaks at 65% — and falls to 23.5% when tested across four different surface renderings of the same problem, a method designed to filter out pattern-matching that isn't real reasoning.
The surface-rendering test is the sharpest finding here. When the same logical problem is rephrased four ways and the worst score is taken, model accuracy collapses by roughly 40 percentage points — a sign that models are often recognizing problem formats rather than solving the underlying logic. Chain-of-thought prompting, the technique AI labs frequently cite as evidence of deeper reasoning, showed variance of about 36 percentage points, larger than any gap between models. That suggests the prompting method is noisy, not a reliable signal of capability.
This lands at an awkward moment: reasoning has become the marquee marketing claim for nearly every major model release in 2025 and 2026. DeFAb's verifiable gold standards make it harder to paper over failures with fluent-sounding output — which is probably why the researchers also released a harder 235-instance variant where the best model scores 53.3%, and a mathematical creativity variant where models produced zero novel concepts.