[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-dualgauge-shows-llms-still-miss-secure-code-generation":10},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":22,"tags":30,"sources":34,"feedback":38,"feedback_at":22,"cost_usd":38,"total_tokens":38},1333,"dualgauge-shows-llms-still-miss-secure-code-generation","DualGauge shows LLMs still miss secure code generation","Automated benchmarks reveal that even top models achieve under 15% joint correctness and security on specification-only tasks.","DualGauge adds a joint correctness‑security benchmark for code generated from plain specs.\n\nThe new framework runs every LLM‑generated solution through functional tests and security checks drawn from the same specification. It covers 307 tasks in Python, C++ and JavaScript and evaluates ten LLMs plus three coding agents. The best model still passes both test suites on fewer than 15% of attempts, and none of the usual scaling tricks—larger model size, chain‑of‑thought prompting, quantization, or instruction tuning—show consistent gains.\n\nWhy it matters: developers can no longer assume that higher functional scores imply safe code. The gap appears at contract boundaries and in weak input guards, a pattern only visible when functional and security metrics are combined. The finding also dials down hype around iterative agents; Codex, OpenHands and Claude Code performed no better than straight‑forward LLM generation on these specification‑only prompts.\n\nThe takeaway for practitioners is clear: before deploying AI‑written code, run a dedicated security audit that mirrors functional testing. Researchers should expand benchmarks to cover more realistic integration scenarios, where secure‑by‑design prompts and verification tools might close the current 85% shortfall.","[\"llm\",\"code-generation\",\"security\"]","2026-06-16T04:00:00.000Z","2026-06-17T04:33:35.628Z","2026-06-17T04:33:38.457Z","published",null,[24],{"id":25,"reviewer":26,"round":27,"reason":28,"status":29},"editor-r1","editor",1,"Add a clear concluding paragraph that summarises the implications and next steps for developers or researchers.","resolved",[31,32,33],"llm","code-generation","security",[35],{"name":36,"url":37},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20709",0]