A new benchmark forces LLMs to hop Wikipedia links toward a target page, exposing planning weaknesses.
LLM-WikiRace presents a source article and a goal page; models must choose hyperlinks step‑by‑step to reach the goal. The test has an easy tier and a hard tier that requires longer look‑ahead. Open‑ and closed‑source models—including Gemini-3, GPT-5 and Claude Opus 4.5—hit near‑human or superhuman scores on the easy tier. On the hard tier the best model, Gemini-3, succeeded in only 23% of games. Analysis shows world knowledge helps up to a point, but beyond that planning and long‑horizon reasoning dominate performance. Even top models often loop back on themselves after a misstep instead of replanning.
The result matters because many headlines now tout LLMs as “reasoning agents.” This benchmark strips away prompt‑engineering tricks and forces the model to chart a path through real‑world knowledge structures. The sharp drop from easy to hard suggests current systems are still far from autonomous planners, a gap that could affect applications like automated research assistance or multi‑step decision support.
In short, LLM‑WikiRace provides a low‑cost, transparent arena where planning capability is the bottleneck, reminding us that raw language fluency does not equal functional reasoning.