llm/ security · benchmark

LLM agents fix only half of real Python CVEs in new benchmark

A new test of five language models shows they solve 50% of 20 genuine vulnerabilities, with cost driving most differences.

  • A benchmark of 20 real CVEs across 18 Python libraries found LLM agents successful on just half of the fixes.

The author ran 300 sandboxed attempts using five agents (three from OpenAI, two from poolside) and three prompt styles: full advisory, locate, and diagnose. Success was measured against hidden security tests derived from the maintainer’s own patches. The best solve rate was 50%; the other attempts either passed regression tests while leaving the flaw intact or produced incoherent patches. Cost varied dramatically: gpt-5.5 cost twelve times more than gpt-5.4-mini yet yielded statistically similar results. Within‑model family differences were minimal, suggesting training data, not architecture, drives performance. A power analysis indicated about 700 tasks would be needed to spot a genuine edge within a model family.

The takeaway is that current LLM agents are still a blunt tool for automated security fixes, and paying more for a larger model doesn’t buy better results.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →