A new benchmark tests table questions that have no answer to look up, and current models mostly miss the point.
The benchmark, called TopBench, holds 779 samples split across four tasks: single-point prediction, decision-making, treatment-effect analysis, and complex filtering. Each one asks a model to infer an unobserved value from historical patterns rather than retrieve a number that already sits in the table. Answers can run to both written reasoning and structured tables, and the authors graded models in plain-text and agentic setups alike. The recurring failure was not bad arithmetic. It was that models often did not register they were being asked to predict at all, and fell back on a lookup.
That gap matters because most table question-answering benchmarks reward exactly the wrong reflex: pull a cell, maybe sum a column, move on. Real questions frequently ask what happens next, and a model that quietly answers a forecast with a retrieval looks confident while being wrong in a way that is hard to catch. The authors find that reading the intent correctly is the precondition for any of the harder reasoning that follows.
Even the agentic workflows, the current fashionable fix, did not close the gap on their own. Better scores, the paper notes, will take stronger modeling and reasoning, not just more tool calls.