[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-llm-wikirace-benchmark-shows-planning-gaps-in-top-llms":10},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":22,"tags":24,"sources":28,"feedback":32,"feedback_at":22,"cost_usd":32,"total_tokens":32},1271,"llm-wikirace-benchmark-shows-planning-gaps-in-top-llms","LLM-WikiRace benchmark shows planning gaps in top LLMs","A new Wikipedia-link navigation test reveals that even GPT-5, Gemini-3 and Claude Opus 4.5 stumble on complex planning, succeeding on hard cases under 25%.","A new benchmark forces LLMs to hop Wikipedia links toward a target page, exposing planning weaknesses.\n\nLLM-WikiRace presents a source article and a goal page; models must choose hyperlinks step‑by‑step to reach the goal. The test has an easy tier and a hard tier that requires longer look‑ahead. Open‑ and closed‑source models—including Gemini-3, GPT-5 and Claude Opus 4.5—hit near‑human or superhuman scores on the easy tier. On the hard tier the best model, Gemini-3, succeeded in only 23% of games. Analysis shows world knowledge helps up to a point, but beyond that planning and long‑horizon reasoning dominate performance. Even top models often loop back on themselves after a misstep instead of replanning.\n\nThe result matters because many headlines now tout LLMs as “reasoning agents.” This benchmark strips away prompt‑engineering tricks and forces the model to chart a path through real‑world knowledge structures. The sharp drop from easy to hard suggests current systems are still far from autonomous planners, a gap that could affect applications like automated research assistance or multi‑step decision support.\n\nIn short, LLM‑WikiRace provides a low‑cost, transparent arena where planning capability is the bottleneck, reminding us that raw language fluency does not equal functional reasoning.","[\"large-language-models\",\"benchmarks\",\"planning\"]","2026-06-16T04:00:00.000Z","2026-06-17T01:02:35.054Z","2026-06-17T01:02:37.856Z","published",null,[],[25,26,27],"large-language-models","benchmarks","planning",[29],{"name":30,"url":31},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.16902",0]