A research benchmark called NRT-Bench puts frontier AI models in charge of a simulated nuclear power plant, then attacks them until something breaks.
Researchers built a five-role operator team, each role backed by a configurable large language model, running a simulated plant governed by six critical safety functions. Adversaries inject messages across four channels over multiple conversational turns. The failure condition is concrete: a run ends the moment any safety function is lost, and the attacking message is flagged as the cause. Testing four frontier models under a fixed-attack replay protocol, the team found that adaptive multi-turn attacks pushed every model past a safety threshold — between 8.7% and 12.1% of attack sessions ended with a lost safety function.
The more important finding is not the failure rate but the failure pattern: of 149 test sessions, not one attack defeated all four models, yet roughly a third defeated at least one. Model vulnerabilities are nearly disjoint — knowing where one model breaks tells you almost nothing about where another will. That finding undermines the assumption that a safer-looking aggregate score means a safer system. It also complicates defense: the same guardrail stack that reduced attack success on one model raised it on another.
LLM agents are already being proposed as supervisory components for high-stakes infrastructure, a use case that tends to outrun safety evidence. NRT-Bench, with its simulation environment, attack dataset, and replay tooling released publicly, at least gives researchers a reproducible way to measure what they are actually deploying — which is more than most safety claims in this space can say.