Mask-Proof turns existing mathematical proofs into masked-step questions that an LLM must fill in.
The authors took real research proofs, hid critical formula steps, and kept surrounding context. An LLM‑based equivalence judge scored each reconstruction, using repeated votes for stability. The resulting benchmark – Mask-ProofBench – contains 292 problems from a range of fields. Tests on 17 language models showed reasoning‑enhanced variants beat standard versions by 12‑27%, while the judge matched expert annotators 96.8% of the time.
This matters because most current math‑oriented benchmarks focus on final answers or require costly human grading. By checking intermediate steps automatically, researchers can compare models on proof‑level reasoning at scale. The high agreement with experts also means the metric is trustworthy enough for iterative model development.
If the community adopts this approach, we may see faster progress on AI‑assisted theorem proving, but the pipeline still depends on a handcrafted masking stage and a single judge model, so broader validation will be needed.