Jailbreaking a language model may come down to a few features buried deep in its wiring.
Researchers ran Gemma-2-2B, a small open-weight model, through a sparse autoencoder to split its internal activations into discrete features, then hunted for the ones tied to unsafe output. They pulled single-category harmful examples from the BeaverTails dataset to cut cross-topic noise, matched harmful concepts in adversarial responses to the prompt tokens that evoked them, and grouped the resulting features three ways across all 26 layers. Amplifying the top features in each group and scoring the output with a standardized harmfulness judge, they found that grouping driven by a single harmful token worked about as well as the broader cluster-based method. The vulnerable features showed up early and late, but clustered in the mid-to-late layers.
Most mechanistic safety work so far has explained jailbreaks through broad objects: a global refusal direction, an activation-steering vector, a handful of refusal features. This study argues the weak points are narrower and more local than that. If vulnerability really lives in sparse, token-addressable subgroups, defenders get a more precise place to look, and so does anyone trying to pry the model open.
The caveat is size. This is one 2B-parameter model under controlled conditions, and "comparable harmfulness" in front of a lab judge is not the same as a working attack on a frontier system. The map may keep getting more detailed; whether it scales is the open question.