A new benchmark called SorryDB now streams real Lean tasks from 78 open‑source formalisation projects.
The dataset updates continuously, replacing static collections that usually consist of competition puzzles. Researchers tested several AI approaches—general‑purpose large language models, an agentic system built on Gemini Flash, and dedicated symbolic provers—on a 1,000‑task snapshot. The Gemini‑based agent topped the list, but it was not decisively better than the other models or existing Lean tactics.
Because the benchmark evolves with the community, it forces provers to handle fresh dependencies and avoids the “train‑on‑test” problem that plagues static suites. In practice, tools that score well on SorryDB should be more useful to mathematicians working on ongoing projects, not just on curated benchmarks.
If the AI‑proving field keeps relying on static test beds, progress will stay confined to narrow tricks. SorryDB pushes developers toward generalisable reasoning, though the current results show no single approach dominates yet.