DualGauge adds a joint correctness‑security benchmark for code generated from plain specs.
The new framework runs every LLM‑generated solution through functional tests and security checks drawn from the same specification. It covers 307 tasks in Python, C++ and JavaScript and evaluates ten LLMs plus three coding agents. The best model still passes both test suites on fewer than 15% of attempts, and none of the usual scaling tricks—larger model size, chain‑of‑thought prompting, quantization, or instruction tuning—show consistent gains.
Why it matters: developers can no longer assume that higher functional scores imply safe code. The gap appears at contract boundaries and in weak input guards, a pattern only visible when functional and security metrics are combined. The finding also dials down hype around iterative agents; Codex, OpenHands and Claude Code performed no better than straight‑forward LLM generation on these specification‑only prompts.
The takeaway for practitioners is clear: before deploying AI‑written code, run a dedicated security audit that mirrors functional testing. Researchers should expand benchmarks to cover more realistic integration scenarios, where secure‑by‑design prompts and verification tools might close the current 85% shortfall.