AI/ ai · scientific-computing · code-generation · reproducibility

Why AI still flunks turning physics papers into working code

Forcing a paper's unwritten conventions into an explicit spec made quantum simulation code reproducible, but a weak model still flunked a strong one's spec.

Getting an AI to turn a physics paper into working code still fails the moment the paper leaves something unsaid.

A revised preprint argues the hard part of AI-assisted scientific coding isn't the equations — it's the conventions papers never bother to print: index choices, gauge fixing, fermionic sign rules, contraction order, and the checks that tell you the answer is right. The authors call making these explicit "knowledge externalization," and they write it all into a specification before any code is generated. As calibration they used DMRG — the density matrix renormalization group, a workhorse method for simulating one-dimensional quantum systems — drawn from a well-known pedagogical review. Spec-guided runs passed in all 16 model pairings, versus 6 of 13 for direct paper-to-code attempts, and a prose-only version worked just as well, showing the content mattered, not the LaTeX. The stress test was nastier: converting Hartree-Fock-Bogoliubov states into matrix product states, a compact representation of quantum states, from a five-page letter with no public implementation. There the workflow logged 11 of 26 audited passes; direct prompting got none.

This matters because computational physics runs on code that often never ships with the paper, leaving published results no one outside the group can rerun. Externalizing the unwritten rules makes a textbook algorithm reproducible and a research-grade one auditable, which is a real gain for a field where the working code is often the missing half of the proof. But the same experiment marks a hard ceiling.

Cross-model results were lopsided: GPT 5.5 implemented other models' specs 4 out of 4 times, while weaker models failed GPT 5.5's specs 4 out of 4. A better spec, in other words, can't rescue a weaker coder. The takeaway is almost mundane — write down what you know — paired with a less comfortable one: prompting discipline has limits, and past them you simply need a more capable model.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →