New Benchmark Shows Coding Agents Fail After 5 Turns

AI coding agents fall apart fast when given real-world workloads, according to a new benchmark from Amazon Research.

StaminaBench puts coding agents through up to 100 consecutive change requests, simulating the kind of extended sessions that "vibe coding" actually involves. Agents implement a REST API server, then modify it across procedurally generated follow-up tasks — resulting in codebases that can reach 6,000 lines. Tests are generated without LLM involvement to keep results reproducible. Six agent harnesses paired with seven open-source models were evaluated across 20 scenarios.

The headline finding: every tested model failed within 5 to 6 turns when running without test feedback. That matters because most existing benchmarks measure the fraction of isolated tasks solved — a metric that hides how quickly agents degrade under sustained, iterative pressure. Feeding test results back to the agent and allowing retries improved turn counts by up to 12x, and the choice of harness proved nearly as important as the model itself: stronger models showed up to a 6x performance gap depending on harness.

The broader implication is that benchmark scores on one-shot coding tasks are a poor proxy for the multi-turn sessions developers actually run — and that the industry's current enthusiasm for autonomous coding tools may be running ahead of what these models can reliably sustain.

← Back to the front page