Stacking more AI agents in a pipeline was supposed to make answers better — but past a certain depth, it stopped working.
Researchers have published ReM-MoA, a framework designed to fix a persistent failure mode in Mixture-of-Agents architectures. Standard MoA systems route outputs from one layer of AI agents into the next, but gains plateau or reverse as depth increases. ReM-MoA addresses this with two mechanisms: a Ranked Reasoning Memory that stores and scores reasoning traces from every layer using a dedicated Reviewer Agent, and a routing scheme that deliberately feeds different agents different mixes of successful and failed traces. A distillation pipeline using frontier models can further sharpen the Reviewer's ranking quality. Tested across five benchmarks covering math, formal logic, code, factual knowledge, and commonsense reasoning, ReM-MoA outperformed prior MoA variants — and its edge grew wider at greater depth.
The finding matters because inference-time scaling has become one of the main levers AI labs reach for when they want better performance without retraining. If depth scaling in multi-agent systems reliably degrades, that lever breaks early. ReM-MoA's memory layer essentially gives the pipeline a way to learn from its own mistakes mid-run rather than discarding them.
The catch: injecting a comparative Reviewer Agent and a curated routing scheme adds complexity and likely latency — costs the paper does not dwell on. Whether the benchmark gains survive contact with messier real-world tasks, or whether the overhead is worth it outside research settings, is the open question every "we outperform prior variants" paper leaves on the table.
