[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-jailbreaks-trace-to-sparse-token-level-features-in-llms":10,"sections":35},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":24,"tags":25,"sources":30,"feedback":34,"feedback_at":22,"cost_usd":34,"total_tokens":34},1610,"jailbreaks-trace-to-sparse-token-level-features-in-llms","Jailbreaks Trace to Sparse, Token-Level Features in LLMs","A new study decomposes Gemma-2-2B and finds jailbreak vulnerability concentrated in sparse feature subgroups that a single harmful prompt token can pinpoint.","Jailbreaking a language model may come down to a few features buried deep in its wiring.\n\nResearchers ran Gemma-2-2B, a small open-weight model, through a sparse autoencoder to split its internal activations into discrete features, then hunted for the ones tied to unsafe output. They pulled single-category harmful examples from the BeaverTails dataset to cut cross-topic noise, matched harmful concepts in adversarial responses to the prompt tokens that evoked them, and grouped the resulting features three ways across all 26 layers. Amplifying the top features in each group and scoring the output with a standardized harmfulness judge, they found that grouping driven by a single harmful token worked about as well as the broader cluster-based method. The vulnerable features showed up early and late, but clustered in the mid-to-late layers.\n\nMost mechanistic safety work so far has explained jailbreaks through broad objects: a global refusal direction, an activation-steering vector, a handful of refusal features. This study argues the weak points are narrower and more local than that. If vulnerability really lives in sparse, token-addressable subgroups, defenders get a more precise place to look, and so does anyone trying to pry the model open.\n\nThe caveat is size. This is one 2B-parameter model under controlled conditions, and \"comparable harmfulness\" in front of a lab judge is not the same as a working attack on a frontier system. The map may keep getting more detailed; whether it scales is the open question.","[\"llm safety\",\"interpretability\",\"jailbreaks\",\"ai security\"]","2026-06-18T04:00:00.000Z","2026-06-19T05:49:40.738Z","2026-06-19T05:49:43.525Z","published",null,[],"ai",[26,27,28,29],"llm safety","interpretability","jailbreaks","ai security",[31],{"name":32,"url":33},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.23130",0,{"sections":36},[37,41,45,50,55,60,65,70,74,78,83,88,93,98],{"name":38,"slug":24,"count":39,"latest_published_at":40},"AI",490,"2026-06-19T04:00:00.000Z",{"name":42,"slug":43,"count":44,"latest_published_at":40},"Security","security",132,{"name":46,"slug":47,"count":48,"latest_published_at":49},"Policy","policy",88,"2026-06-16T09:26:09.000Z",{"name":51,"slug":52,"count":53,"latest_published_at":54},"Consumer Tech","consumer-tech",78,"2026-06-16T17:58:24.000Z",{"name":56,"slug":57,"count":58,"latest_published_at":59},"Hardware","hardware",62,"2026-06-18T15:24:16.000Z",{"name":61,"slug":62,"count":63,"latest_published_at":64},"Software","software",58,"2026-06-16T20:00:00.000Z",{"name":66,"slug":67,"count":68,"latest_published_at":69},"Deals","deals",56,"2026-06-19T12:30:04.000Z",{"name":71,"slug":72,"count":73,"latest_published_at":40},"Dev Tools","dev-tools",50,{"name":75,"slug":76,"count":77,"latest_published_at":18},"Science","science",38,{"name":79,"slug":80,"count":81,"latest_published_at":82},"Gaming","gaming",31,"2026-06-16T15:25:13.000Z",{"name":84,"slug":85,"count":86,"latest_published_at":87},"General","general",26,"2026-06-13T18:35:15.000Z",{"name":89,"slug":90,"count":91,"latest_published_at":92},"Startups","startups",23,"2026-06-16T15:00:00.000Z",{"name":94,"slug":95,"count":96,"latest_published_at":97},"Reviews","reviews",19,"2026-06-14T08:00:00.000Z",{"name":99,"slug":100,"count":101,"latest_published_at":102},"How-To","how-to",6,"2026-06-16T09:00:00.000Z"]