AI/ ai · benchmarks · forecasting · research

A Civilization Game Is Now an AI Forecasting Benchmark

ForecastBench-Sim uses Freeciv rollouts to test AI probabilistic reasoning without waiting years for real-world outcomes to resolve.

Researchers have built a forecasting benchmark for AI systems out of a strategy game — and it might solve one of the most annoying problems in the field.

ForecastBench-Sim uses Freeciv, an open-source turn-based strategy game modeled on the Civilization series, to generate forecasting questions from live game states. A model receives a structured snapshot of the current game world, answers questions about what will happen next, and then the simulation runs forward to score those predictions. Because it is a simulation, questions can target any time horizon, cover rare or catastrophic events, and support counterfactual setups — things like "what would have happened if this civilization had chosen a different policy." The benchmark includes both binary and continuous question types, and the full pipeline, question families, and scoring protocol are being released publicly.

Existing forecasting benchmarks inherit real-world constraints: outcomes take months or years to resolve, tail events almost never appear in training data, and it is nearly impossible to run controlled experiments with alternate histories. ForecastBench-Sim sidesteps all three problems by treating the game engine as a kind of on-demand reality that can be paused, forked, and re-run. The researchers also ran a human pilot alongside the model evaluations, giving at least a baseline for comparing AI to people.

The benchmark is positioned as a complement to real-world forecasting tests, not a replacement — and that caveat matters. Freeciv is a tidy, rule-governed world; geopolitical forecasting is not. A model that dominates Freeciv rollouts still has to prove it can reason under the genuine ambiguity of the messy, unstructured world it will actually be deployed in.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →