Poisoned fine‑tuning lets attackers stealthily hijack LLM behavior

A new study reveals that LLMs fine‑tuned on uncurated data can be silently hijacked through a semantic backdoor the authors call a covert control attack.

The researchers train a model on a tiny poisoned slice of the training set, teaching it to associate attacker‑chosen phrases with hidden instructions via shared facts. The scheme works across five popular LLMs and survives three backdoor defenses and four prompt‑injection defenses. With only a small fraction of poisoned data, attack success climbs about 40 % higher than standard prompt‑injection tricks, reaching up to 93 % success after backdoor defenses and 98 % after prompt‑injection defenses.

This matters because most deployment pipelines trust fine‑tuned models without rigorous data vetting. Existing defenses focus on obvious trigger words or abnormal training patterns; the covert control method hides in ordinary semantic relationships, making detection far harder. Operators may need to audit training data more aggressively or develop defenses that look for hidden semantic channels.

In short, the paper proves that even minimal poison can give attackers reliable, stealthy command channels, a vulnerability that current defensive tools are not built to catch.