How to Actually Measure Whether AI Can Enable Bioweapons

AI biology risk evaluations look rigorous on paper, but the numbers may mean less than they appear to.

A paper posted to arXiv this week examines how AI agents are being tested for biological risk - that is, whether systems capable of running multi-step scientific tasks could help someone develop dangerous pathogens or weapons. The authors argue that the results of such evaluations are deeply sensitive to decisions made before a single test runs: how risk is defined, how tasks are designed, how scoring works, and how everything is documented afterward. Those choices, they write, are often implicit or missing from published results entirely. The paper draws on the authors' own hands-on evaluation experience and targets three audiences - policymakers, funders, and biosecurity practitioners - with a secondary nod to researchers inside frontier AI labs and third-party evaluators.

This matters because evaluation results are already shaping policy. Governments and regulators are making decisions about AI deployment based on outputs from assessments whose methodological foundations are not publicly scrutinized. If the design choices baked into those assessments are undisclosed, a clean bill of health from a biocapability evaluation is much harder to interpret - or trust.

The paper's framing is careful to a fault: it is not claiming AI systems are dangerous right now, nor that existing evaluations are fraudulent. It is saying the infrastructure for producing credible evidence is immature, and that infrastructure matters as much as the results it generates. That is a quieter alarm than headlines about AI and bioweapons tend to be, but arguably a more useful one - the hard problem here is not detecting a weapon, it is knowing whether your detector works.

← Back to the front page