AI/ ai · formal-methods · theorem-proving · research

VERITAS Squeezes More From AI Proof Search by Reading Error Signals

A new zero-shot framework routes verifier feedback into proof search, lifting formal theorem-proving accuracy above brute-force sampling baselines.

A research framework called VERITAS improves how AI systems search for formal mathematical proofs - by actually reading the errors they generate.

Most large-language-model-based provers treat a verifier's output as binary: the proof either passed or it failed. VERITAS, presented in a new paper, takes a different approach. It runs a two-phase protocol: first, Best-of-N sampling generates candidate proofs; then a critic-guided tree search uses the Phase 1 failures as explicit negative examples to steer further exploration. The result on the standard miniF2F benchmark is 40.6% accuracy, compared to 36.9% for a standalone Best-of-5 baseline and 26.2% for a portfolio approach. On a new 55-theorem combinatorics benchmark the team calls VERITAS-CombiBench, the gap is more dramatic - unguided sampling scored only 1.8% while VERITAS reached 7.3%, suggesting that brute-force generation actively hurts when finding the right lemma names requires iterating on verifier feedback.

Formal theorem proving is one of the cleaner tests of whether a model actually reasons or just pattern-matches. Progress here matters because verified proofs carry mathematical certainty that no amount of confident text generation can fake. The combinatorics result is the more interesting data point: it shows that throwing more samples at a hard problem can make things worse, not just plateau.

The framework is zero-shot, meaning it requires no task-specific fine-tuning - a practical detail that distinguishes it from prior work that leans on training-time specialization. Whether VERITAS holds up across broader benchmark suites beyond miniF2F and its own combinatorics set is the obvious next question.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →