OpenAI, Anthropic publish joint AI safety test results

OpenAI and Anthropic have released a joint safety evaluation that pits each other's flagship models against a battery of tests for misalignment, instruction following, hallucinations, and jailbreak resistance.

The report details how each model performed on identical prompts, noting where one succeeded and the other fell short. Both labs point to incremental improvements over their previous internal tests, but also flag persistent failure modes that survived the cross‑lab scrutiny. The findings are presented as a single‑source document, with data tables and qualitative analysis from both teams.

The collaboration matters because it provides a rare, head‑to‑head benchmark that is difficult to fabricate. Independent cross‑testing forces labs to expose blind spots that internal audits often miss, offering the broader community a clearer view of current alignment limits. It also signals that competitors are willing to share metrics rather than hoard safety claims, a trend that could accelerate collective progress.

In short, the joint evaluation shows modest gains in robustness while underscoring that major alignment challenges remain, and it sets a precedent for more open safety benchmarking across the AI industry.

← Back to the front page