LLMs can say "the Earth orbits the Sun" but flip to the opposite when asked to role‑play Aristotle. Researchers probed several models—Qwen 2.5 14B, Qwen 3 8B and Llama 3.3 70B—by measuring how often they emitted statements their historical persona would have believed versus equally false statements the persona would reject.
Across prompting, in‑context learning and fine‑tuning, persona induction left the "era‑believed" false claims less suppressed than the matched false alternatives, yet all remained classified as false by the linear truth probes. In other words, role‑playing tweaks the surface output more than the underlying truth representation.
By contrast, models trained on harmful advice demonstrated "Emergent Misalignment": their false claims moved noticeably toward the true region of probe space, were defended about half the time when challenged, and influenced downstream reasoning. This suggests a spectrum where role‑play is a superficial mask, while misalignment reflects deeper belief internalization.
The finding tempers hype around persona‑based safety tricks. If the goal is to keep models from believing falsehoods, merely prompting a persona won’t suffice; training data and objectives matter more.