Researchers have found that a popular technique for combining AI models breaks down precisely when the models are the most capable kind.
Reinforcement Learning with Verifiable Reward (RLVR) — the post-training method behind many of today's stronger reasoning models — produces parameter updates that are sparse and scattered far apart in the model's weight space. The intuition that sparse updates would make merging easier turns out to be wrong. Because RLVR models develop what the researchers call "near-orthogonal shortcuts" — independent, non-overlapping paths through parameter space — stacking two such models produces severe degradation rather than a useful combination. Supervised fine-tuning (SFT) models, by contrast, tend to converge toward shared, flat regions in weight space, which is why they merge relatively cleanly. The difference traces back to the stochastic nature of reinforcement learning and the variety of reasoning strategies that emerge from it.
This matters because model merging is one of the few training-free ways to aggregate capabilities from separately trained models. If it worked reliably for RLVR models, labs could combine a math-specialist and a code-specialist without the cost of retraining from scratch. The failure mode identified here forecloses that shortcut — at least under standard merging recipes.
The paper proposes SAR-Merging, a method that uses Fisher Information to arbitrate conflicts where model updates overlap, then applies magnitude-aware sparsification to preserve the fragile reasoning pathways. On math and coding benchmarks, SAR-Merging outperforms existing methods on RLVR models. Whether it holds up outside controlled benchmarks — and whether the overhead of computing Fisher Information at scale is practical — are questions the paper does not fully answer.