AI Math Benchmarks Miss Problems Models Can Actually Solve

Standard AI math benchmarks may be mislabeling problems as too hard when the real issue is how researchers sample model outputs.

A paper published on arXiv tested eight math benchmark configurations — GSM8K and MATH datasets across four open-weight models — and found that 10.3 to 22.9 percent of problems no model solved in six standard attempts were actually solvable at the same compute budget. The trick: instead of resampling randomly, the researchers applied a technique called activation grafting, which makes small, targeted edits to a model's internal representations. Greedy decoding alone solved at most 6 percent of those supposedly impossible problems; adding five perturbations via activation grafting recovered the rest. The effect grew with additional compute budget.

This matters because the benchmark metric at the center of this — pass@k, the share of sampled solution chains that reach the correct answer — is not just a scoreboard number. It drives how labs curate training data, design reinforcement learning curricula, and train the verifiers used to judge model outputs. If that signal systematically mislabels a chunk of hard problems as unreachable, every downstream system trained on those labels inherits the error.

The finding also complicates the current AI scaling narrative. Much of the case for throwing more compute at reasoning models rests on pass@k curves — more samples, better coverage. This research suggests the curve has a structural blind spot baked in, and that the hardest stratum of problems is identifiable in the model's internal state, not just unreachable by design.

← Back to the front page