[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-weight-norms-do-not-cause-grokking-logit-scale-does":10,"sections":34},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":24,"tags":25,"sources":29,"feedback":33,"feedback_at":22,"cost_usd":33,"total_tokens":33},1647,"weight-norms-do-not-cause-grokking-logit-scale-does","Weight Norms Do Not Cause Grokking - Logit Scale Does","New research finds the weight norm's role in grokking is indirect: it controls generalization only by setting the logit scale fed into softmax.","A new paper reframes what drives grokking, the phenomenon where neural nets suddenly generalize long after they appear to have only memorized training data.\n\nResearchers fixed weight norms using clamping, then varied output temperature independently. That let them slide the grokking delay across its full range without touching the norm itself. Matching the effective logit scale back to a baseline recovered about 85% of the delay across two moduli. When they mapped delay against a grid of norms and temperatures together, logit scale alone explained the variance with an R-squared of 0.97; the norm contributed only 1-2% on top. The effect is also loss-function-dependent: under mean-squared error, the logit scale stays fixed and the norm takes a different path, which tells you the cross-entropy result is not a universal law.\n\nThe distinction matters because most grokking research treats weight norm as the causal lever and regularization as the prescription. If the norm is only an upstream handle on logit scale and the softmax saturation it produces, then interventions aimed at the norm may be solving the wrong variable - and researchers tuning weight decay to speed generalization may be one step removed from what actually works.\n\nThe team also ran a float64 softmax-collapse audit and tested a no-LayerNorm transformer to close off alternative explanations; a forking-arms experiment confirmed the delay follows the held norm value, not the clamping operation itself, ruling out a rescaling artifact. All results reproduce from released code and data - a detail worth noting when mechanistic interpretability papers often do not.","[\"machine learning\",\"research\",\"neural networks\",\"ai\"]","2026-06-18T04:00:00.000Z","2026-06-19T09:04:03.022Z","2026-06-19T09:04:04.530Z","published",null,[],"ai",[26,27,28,24],"machine learning","research","neural networks",[30],{"name":31,"url":32},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.18465",0,{"sections":35},[36,40,44,49,54,59,64,69,73,77,82,87,92,97],{"name":37,"slug":24,"count":38,"latest_published_at":39},"AI",490,"2026-06-19T04:00:00.000Z",{"name":41,"slug":42,"count":43,"latest_published_at":39},"Security","security",132,{"name":45,"slug":46,"count":47,"latest_published_at":48},"Policy","policy",88,"2026-06-16T09:26:09.000Z",{"name":50,"slug":51,"count":52,"latest_published_at":53},"Consumer Tech","consumer-tech",78,"2026-06-16T17:58:24.000Z",{"name":55,"slug":56,"count":57,"latest_published_at":58},"Hardware","hardware",62,"2026-06-18T15:24:16.000Z",{"name":60,"slug":61,"count":62,"latest_published_at":63},"Software","software",58,"2026-06-16T20:00:00.000Z",{"name":65,"slug":66,"count":67,"latest_published_at":68},"Deals","deals",56,"2026-06-19T12:30:04.000Z",{"name":70,"slug":71,"count":72,"latest_published_at":39},"Dev Tools","dev-tools",50,{"name":74,"slug":75,"count":76,"latest_published_at":18},"Science","science",38,{"name":78,"slug":79,"count":80,"latest_published_at":81},"Gaming","gaming",31,"2026-06-16T15:25:13.000Z",{"name":83,"slug":84,"count":85,"latest_published_at":86},"General","general",26,"2026-06-13T18:35:15.000Z",{"name":88,"slug":89,"count":90,"latest_published_at":91},"Startups","startups",23,"2026-06-16T15:00:00.000Z",{"name":93,"slug":94,"count":95,"latest_published_at":96},"Reviews","reviews",19,"2026-06-14T08:00:00.000Z",{"name":98,"slug":99,"count":100,"latest_published_at":101},"How-To","how-to",6,"2026-06-16T09:00:00.000Z"]