How Jailbreak Demos Actually Work Inside Safety-Trained LLMs

Safety-aligned language models don't all respond to in-context jailbreak attempts the same way — and new research explains why.

A study posted to arXiv examined what happens when you feed a language model a mix of benign and harmful compliance demonstrations — that is, example exchanges where a model helpfully answers both innocuous and problematic requests. Testing four models, the researchers found that benign demonstrations are not a safe buffer. Depending on the model, adding more harmless examples could either reduce harmful compliance or, counterintuitively, increase it. The research also identified a strong recency bias in demonstration ordering: examples placed later in the context window carry more weight. And when models do refuse a harmful request, they split into two camps — some still mimic the formatting of demonstrated responses, while others discard all in-context signals entirely.

The finding that cuts deepest is about training methodology: preference optimization — the stage where models learn to prefer certain outputs over others — turns out to be the critical factor that stops benign demonstrations from making things worse. That means two models with similar surface-level safety behavior may be structurally different in how they process context, which matters a great deal for anyone deploying or red-teaming these systems. It also suggests that evaluating a model's safety by its refusal rate alone misses a lot.

This moves the jailbreak conversation past the "it works" observation and into the mechanics — useful territory for defenders, and a reminder that alignment is less a switch than a spectrum with model-specific quirks.

← Back to the front page