A research framework called SIGMA is outperforming both open- and closed-source AI systems on hard math benchmarks by spreading reasoning across multiple agents instead of relying on one.
Current retrieval-augmented models typically pull in outside knowledge from a single angle and follow rigid search strategies — a design that breaks down on problems requiring synthesis across sources. SIGMA takes a different approach: it spins up specialized agents that each reason independently, generate hypothetical passages to sharpen their retrieval, and then feed findings to a moderator that reconciles the results. Tested on MATH500, AIME, and the PhD-level science benchmark GPQA, the framework posted a 7.4 percentage-point absolute improvement over prior systems.
That margin matters because AIME and GPQA are not toy datasets — they are the benchmarks labs reach for precisely when they want to argue a model can do serious technical work. A 7.4-point gain across all three is hard to explain away as benchmark cherry-picking. It also signals that the bottleneck in math AI is less raw model size and more how a system finds and combines relevant knowledge mid-reasoning.
The code is not yet public, promised only "upon publication" — so treat the numbers as peer-review-pending until independent teams can replicate them.