A research tool called AdversaBench broke every large language model it was aimed at — and the attacks carried over to models it never trained against.
AdversaBench is an end-to-end red-teaming pipeline that takes seed prompts, mutates them with five structured operators, and then confirms genuine failures using a three-judge panel with a meta-judge tiebreaker. Across 45 seed prompts spanning reasoning, instruction-following, and tool use, every single seed produced a confirmed failure. The pipeline is open-source, with code and datasets published on GitHub.
The transfer result is the finding worth watching. Adversarial prompts generated against Llama 3.1 8B moved to Llama 3.3 70B with zero additional tuning — a gap of roughly 60 billion parameters. That suggests the mutations are exploiting general behavioral patterns baked into the training process, not quirks of a specific model. If that holds at scale, red-teaming one model in a family may be enough to surface weaknesses across the rest.
The paper also flags a measurement problem that should make anyone nervous about published safety benchmarks: pairwise judge agreement ran 80-87%, but Cohen's kappa was near zero because of label skew, meaning raw agreement numbers flatter the reliability of automated evaluation. The harder category — instruction-following — took an average of 2.4 attacker iterations to crack, versus 1.1 for reasoning and tool use, a gap that flat binary pass-fail rates would never surface. Safety claims built on those rates deserve a second look.
