LLM agents fix only half of real Python CVEs in new benchmark

A benchmark of 20 real CVEs across 18 Python libraries found LLM agents successful on just half of the fixes.

The author ran 300 sandboxed attempts using five agents (three from OpenAI, two from poolside) and three prompt styles: full advisory, locate, and diagnose. Success was measured against hidden security tests derived from the maintainer’s own patches. The best solve rate was 50%; the other attempts either passed regression tests while leaving the flaw intact or produced incoherent patches. Cost varied dramatically: gpt-5.5 cost twelve times more than gpt-5.4-mini yet yielded statistically similar results. Within‑model family differences were minimal, suggesting training data, not architecture, drives performance. A power analysis indicated about 700 tasks would be needed to spot a genuine edge within a model family.

The takeaway is that current LLM agents are still a blunt tool for automated security fixes, and paying more for a larger model doesn’t buy better results.

← Back to the front page