New CODA-BENCH shows code agents stall at data‑heavy tasks

CODA-BENCH reveals that current code agents manage just 61.1% of data‑intensive tasks.

The arXiv paper introduces a benchmark that couples code generation with large‑scale file navigation. Built on a Kaggle‑style Linux sandbox, each of the 1,009 tasks presents roughly 1,000 files across 31 communities. Agents must discover relevant datasets, then write code to perform analysis. Tests of the latest agents show they frequently miss the data discovery step, limiting overall success to 61.1%.

The result matters because real‑world engineering rarely separates code from data. A benchmark that forces agents to juggle both exposes a blind spot in today’s models, which excel at pure coding contests but falter when data handling is required. This gap could stall automation claims until agents learn to treat the file system as a first‑class resource.

In short, a 61.1% pass rate signals that we are still far from truly autonomous AI engineers; future work must close the data‑code loop.

← Back to the front page