Sequential AI Alignment Is More Nuanced Than Anyone Assumed

Researchers find that training AI models on multiple behavioral goals in sequence doesn't erase what the model learned before - it depends.

A team studying sequential Direct Preference Optimization - a technique labs use to nudge language models toward preferred behavior - ran Llama-3.1-8B-Instruct through four different training scenarios using lightweight LoRA adapters. They measured what happened to earlier-learned preferences after each subsequent training stage. The result: no single pattern. Earlier training survived in some cases, degraded in others, transferred positively in a third, or simply redistributed across specific example pairs.

The finding matters because most AI alignment pipelines are built on an implicit assumption: train new objectives and you probably hurt old ones. If that assumption is wrong - or only partially right - teams building safety or quality layers into large language models have more room to maneuver than they thought. The study also flags that aggregate accuracy metrics can hide what is happening at the level of individual examples, a methodological blind spot the field has largely ignored.

The researchers found no strong evidence that training stages fight each other through direct gradient opposition - the usual theoretical suspect. That leaves the mechanism unexplained, which means practitioners are still flying somewhat blind when stacking training objectives.

← Back to the front page