S‑SPPO stabilises self‑play preference tuning for LLMs

S‑SPPO prevents instability in self‑play preference optimisation for LLMs.

The paper identifies a flaw in the recent Self‑Play Preference Optimisation (SPPO) loop: when the preference oracle is overly certain about semantically identical answers, the model drifts toward degenerate policies. The authors introduce a two‑part calibration. First, a semantic gate lowers win‑rate targets as response overlap rises, nudging the policy toward a maximum‑entropy baseline. Second, latent‑space repulsion spreads chosen and rejected embeddings, avoiding manifold collapse. Theory shows the tweaks keep the underlying constant‑sum game, so convergence to a Nash equilibrium is still guaranteed. Experiments with Llama‑3‑8B on AlpacaEval 2.0 report a 52.19 % overall win rate and 47.46 % win rate when length is controlled, both higher than vanilla SPPO, and they achieve this without any extra human‑annotated preferences.

Why it matters: Preference alignment has become a bottleneck for scaling LLMs responsibly. If self‑play loops collapse, developers must fall back on costly human feedback. S‑SPPO offers a way to harvest synthetic preferences safely, narrowing the gap between cheap self‑play and expensive human‑in‑the‑loop methods. Its semantic gating also hints at a broader principle—align optimisation targets with the model’s own uncertainty—to keep training dynamics stable.

The result is a modest but tangible step toward fully autonomous preference tuning, echoing earlier attempts like RLHF‑lite that tried to reduce human labours. Whether the approach scales to larger models or more nuanced tasks remains to be seen, but it shows that the instability problem is not insurmountable.

← Back to the front page