A Benchmark That Tries to See Into Coding's Future

Researchers want AI coding agents tested on problems they have not already seen — so a new method skips the past entirely and synthesizes tasks from forecasts of future code.

The paper, posted to arXiv, introduces SWE-Future, a data synthesis approach that avoids replaying public GitHub pull requests — the standard feed for most coding-agent benchmarks. Instead, it takes a snapshot of a repository at a fixed point in time, forecasts what kinds of tasks (bug fixes, feature additions, refactors) that repo is likely to need next, and generates synthetic coding tasks conditioned on those forecasts. The team ran a retrospective check on 80 repositories: forecasts made before a cutoff date matched actual future pull requests at a rate of 58.1 percent under the paper's main semantic matching metric. They then used those validated forecast families to build a 200-task dataset across 61 repositories.

This matters because the standard approach — pulling real GitHub issues and replaying them as benchmark tasks — is increasingly suspect. Models are trained on public code and issues, which means a benchmark that recycles that history may be testing recognition rather than reasoning. SWE-Future's method does not fully solve that problem, but it introduces a principled wedge between what the model was trained on and what it is tested on.

A 58.1 percent relevance score is solid but leaves plenty of room for the method to generate tasks that diverge from what repositories actually needed — raising the question of whether a benchmark that avoids contamination but drifts from reality is trading one flaw for another.

← Back to the front page