[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-llms-make-unreliable-a-b-test-subjects-research-finds":10,"sections":34},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":24,"tags":25,"sources":29,"feedback":33,"feedback_at":22,"cost_usd":33,"total_tokens":33},1815,"llms-make-unreliable-a-b-test-subjects-research-finds","LLMs Make Unreliable A-B Test Subjects, Research Finds","A new statistical framework shows that using LLMs as stand-ins for human participants in A\u002FB tests produces correct results only by assumption, not by design.","Replacing human participants in A\u002FB tests with large language models sounds efficient — it is not necessarily accurate.\n\nResearchers developed a statistical framework adapting surrogate endpoint theory to LLM-based experimentation, examining when treatment effects measured on LLM outputs can reliably stand in for effects that would have been measured on real humans. The short answer: rarely without extra work. Raw LLM predictions recovered only 39% of the human treatment effect in an empirical test using Upworthy headline data. Nonparametric calibration — essentially, tuning LLM outputs against a human baseline — substantially closed that gap, but calibration requires human data to begin with, which undercuts the appeal of skipping humans entirely.\n\nThe gap matters because A\u002FB testing on humans is correct by design; A\u002FB testing on LLMs is correct only if a set of assumptions hold. Those assumptions — surrogacy and comparability — are hardest to justify precisely in the scenarios where running LLM-only tests looks most attractive, such as low-budget or high-speed experiments where collecting human data is the bottleneck. The research also finds that LLM stochasticity introduces both bias and variance, though averaging across multiple draws per unit helps.\n\nOrganizations already using LLMs to simulate user behavior in experiments should treat the findings as a warning label: the method can work, but validating it demands the human pilot studies that many teams are trying to avoid.","[\"ai\",\"research\",\"a-b testing\",\"experimentation\"]","2026-06-19T04:00:00.000Z","2026-06-19T12:31:29.599Z","2026-06-19T14:22:19.964Z","published",null,[],"ai",[24,26,27,28],"research","a-b testing","experimentation",[30],{"name":31,"url":32},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.17165",0,{"sections":35},[36,40,44,49,54,59,64,68,72,77,82,87,92,97],{"name":37,"slug":24,"count":38,"latest_published_at":39},"AI",491,"2026-06-19T14:59:11.000Z",{"name":41,"slug":42,"count":43,"latest_published_at":18},"Security","security",132,{"name":45,"slug":46,"count":47,"latest_published_at":48},"Policy","policy",88,"2026-06-16T09:26:09.000Z",{"name":50,"slug":51,"count":52,"latest_published_at":53},"Consumer Tech","consumer-tech",78,"2026-06-16T17:58:24.000Z",{"name":55,"slug":56,"count":57,"latest_published_at":58},"Hardware","hardware",62,"2026-06-18T15:24:16.000Z",{"name":60,"slug":61,"count":62,"latest_published_at":63},"Deals","deals",58,"2026-06-19T14:43:50.000Z",{"name":65,"slug":66,"count":62,"latest_published_at":67},"Software","software","2026-06-16T20:00:00.000Z",{"name":69,"slug":70,"count":71,"latest_published_at":18},"Dev Tools","dev-tools",50,{"name":73,"slug":74,"count":75,"latest_published_at":76},"Science","science",38,"2026-06-18T04:00:00.000Z",{"name":78,"slug":79,"count":80,"latest_published_at":81},"Gaming","gaming",31,"2026-06-16T15:25:13.000Z",{"name":83,"slug":84,"count":85,"latest_published_at":86},"General","general",26,"2026-06-13T18:35:15.000Z",{"name":88,"slug":89,"count":90,"latest_published_at":91},"Startups","startups",23,"2026-06-16T15:00:00.000Z",{"name":93,"slug":94,"count":95,"latest_published_at":96},"Reviews","reviews",19,"2026-06-14T08:00:00.000Z",{"name":98,"slug":99,"count":100,"latest_published_at":101},"How-To","how-to",6,"2026-06-16T09:00:00.000Z"]