Researchers have built a forecasting benchmark for AI systems out of a strategy game — and it might solve one of the most annoying problems in the field.
ForecastBench-Sim uses Freeciv, an open-source turn-based strategy game modeled on the Civilization series, to generate forecasting questions from live game states. A model receives a structured snapshot of the current game world, answers questions about what will happen next, and then the simulation runs forward to score those predictions. Because it is a simulation, questions can target any time horizon, cover rare or catastrophic events, and support counterfactual setups — things like "what would have happened if this civilization had chosen a different policy." The benchmark includes both binary and continuous question types, and the full pipeline, question families, and scoring protocol are being released publicly.
Existing forecasting benchmarks inherit real-world constraints: outcomes take months or years to resolve, tail events almost never appear in training data, and it is nearly impossible to run controlled experiments with alternate histories. ForecastBench-Sim sidesteps all three problems by treating the game engine as a kind of on-demand reality that can be paused, forked, and re-run. The researchers also ran a human pilot alongside the model evaluations, giving at least a baseline for comparing AI to people.
The benchmark is positioned as a complement to real-world forecasting tests, not a replacement — and that caveat matters. Freeciv is a tidy, rule-governed world; geopolitical forecasting is not. A model that dominates Freeciv rollouts still has to prove it can reason under the genuine ambiguity of the messy, unstructured world it will actually be deployed in.