A research system that combines symbolic logic with a formal geometry language has outperformed Gemini 2.5 Pro and GPT-5 on three-dimensional geometry benchmarks.
Hilbert-Geo translates geometry problems and diagrams into a structured formal language, then uses a pre-built theorem bank to generate step-by-step solutions. The approach splits the work into two stages: a parsing step that converts natural language and visual diagrams into a conditional description language, and a reasoning step that applies formal rules and algebra to produce verifiable answers. On a newly released expert-annotated solid geometry dataset, it reached 77.3% accuracy — compared to 54.2% for Gemini 2.5 Pro on the same benchmark. On a separate benchmark, it scored 84.1% against GPT-5's 62.9%.
Most AI math reasoning work has concentrated on two-dimensional geometry, where large models already perform reasonably well. Solid geometry is harder: it requires interpreting 3D diagrams and chaining spatial inferences across more complex structures — exactly the kind of task where statistical pattern-matching runs out of headroom before symbolic methods do. By grounding solutions in a verifiable theorem bank rather than learned intuition, the system produces answers that can be checked step by step, which matters when math AI is being pitched for education or any context where showing your work is not optional.
A 23-point gap over Gemini 2.5 Pro on solid geometry is the kind of result worth stress-testing — the team released their code and datasets publicly, so at least others can try.