A research framework called VERITAS improves how AI systems search for formal mathematical proofs - by actually reading the errors they generate.
Most large-language-model-based provers treat a verifier's output as binary: the proof either passed or it failed. VERITAS, presented in a new paper, takes a different approach. It runs a two-phase protocol: first, Best-of-N sampling generates candidate proofs; then a critic-guided tree search uses the Phase 1 failures as explicit negative examples to steer further exploration. The result on the standard miniF2F benchmark is 40.6% accuracy, compared to 36.9% for a standalone Best-of-5 baseline and 26.2% for a portfolio approach. On a new 55-theorem combinatorics benchmark the team calls VERITAS-CombiBench, the gap is more dramatic - unguided sampling scored only 1.8% while VERITAS reached 7.3%, suggesting that brute-force generation actively hurts when finding the right lemma names requires iterating on verifier feedback.
Formal theorem proving is one of the cleaner tests of whether a model actually reasons or just pattern-matches. Progress here matters because verified proofs carry mathematical certainty that no amount of confident text generation can fake. The combinatorics result is the more interesting data point: it shows that throwing more samples at a hard problem can make things worse, not just plateau.
The framework is zero-shot, meaning it requires no task-specific fine-tuning - a practical detail that distinguishes it from prior work that leans on training-time specialization. Whether VERITAS holds up across broader benchmark suites beyond miniF2F and its own combinatorics set is the obvious next question.