large-language-models/ benchmarks · planning

LLM-WikiRace benchmark shows planning gaps in top LLMs

A new Wikipedia-link navigation test reveals that even GPT-5, Gemini-3 and Claude Opus 4.5 stumble on complex planning, succeeding on hard cases under 25%.

A new benchmark forces LLMs to hop Wikipedia links toward a target page, exposing planning weaknesses.

LLM-WikiRace presents a source article and a goal page; models must choose hyperlinks step‑by‑step to reach the goal. The test has an easy tier and a hard tier that requires longer look‑ahead. Open‑ and closed‑source models—including Gemini-3, GPT-5 and Claude Opus 4.5—hit near‑human or superhuman scores on the easy tier. On the hard tier the best model, Gemini-3, succeeded in only 23% of games. Analysis shows world knowledge helps up to a point, but beyond that planning and long‑horizon reasoning dominate performance. Even top models often loop back on themselves after a misstep instead of replanning.

The result matters because many headlines now tout LLMs as “reasoning agents.” This benchmark strips away prompt‑engineering tricks and forces the model to chart a path through real‑world knowledge structures. The sharp drop from easy to hard suggests current systems are still far from autonomous planners, a gap that could affect applications like automated research assistance or multi‑step decision support.

In short, LLM‑WikiRace provides a low‑cost, transparent arena where planning capability is the bottleneck, reminding us that raw language fluency does not equal functional reasoning.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →