Fine-tuning large language models on security data makes them look better without making them think better.
Researchers built CWE-Trace, a framework of 834 hand-curated Linux kernel code samples covering 74 vulnerability classes, then used it to test eight base models and 15 fine-tuned variants. To rule out data leakage, they split samples by a strict pre- and post-2025 cutoff. The results were blunt: the best model hit only 52.1% accuracy on binary detection — two points above random chance. Exact vulnerability classification landed below 1.3% Top-1 accuracy. Fine-tuning shifted how confidently models answered; it did not change what they were actually deciding.
The paper coins a useful phrase for this failure mode: "calibration without comprehension." Models adapt their output distributions to match training data while the underlying decision logic stays fixed — a pattern the researchers tracked using two new diagnostic metrics, the Directional Failure Index and Hierarchical Distance and Direction. Those metrics revealed that each model's systematic failure tendencies persisted from historical to post-cutoff data, meaning fine-tuning could not correct them. Notably, data contamination — often blamed when benchmarks look suspiciously good — turned out not to matter: 84% of nominally contaminated samples carried no usable memorization signal.
Security vendors and AI labs have leaned hard on benchmark scores to sell vulnerability-scanning tools, but a ceiling of 52% binary detection in systems software is an argument for skepticism, not procurement.