LLMs Make Unreliable A-B Test Subjects, Research Finds

Replacing human participants in A/B tests with large language models sounds efficient — it is not necessarily accurate.

Researchers developed a statistical framework adapting surrogate endpoint theory to LLM-based experimentation, examining when treatment effects measured on LLM outputs can reliably stand in for effects that would have been measured on real humans. The short answer: rarely without extra work. Raw LLM predictions recovered only 39% of the human treatment effect in an empirical test using Upworthy headline data. Nonparametric calibration — essentially, tuning LLM outputs against a human baseline — substantially closed that gap, but calibration requires human data to begin with, which undercuts the appeal of skipping humans entirely.

The gap matters because A/B testing on humans is correct by design; A/B testing on LLMs is correct only if a set of assumptions hold. Those assumptions — surrogacy and comparability — are hardest to justify precisely in the scenarios where running LLM-only tests looks most attractive, such as low-budget or high-speed experiments where collecting human data is the bottleneck. The research also finds that LLM stochasticity introduces both bias and variance, though averaging across multiple draws per unit helps.

Organizations already using LLMs to simulate user behavior in experiments should treat the findings as a warning label: the method can work, but validating it demands the human pilot studies that many teams are trying to avoid.

← Back to the front page