AI/ ai · benchmarks · operations-research · llm-agents

New Benchmark Shows LLM Agents Struggle With Real Operations Research

ORAgentBench tests 14 AI agent configurations on end-to-end logistics and planning tasks — the best one clears only 35% of them.

A new benchmark called ORAgentBench finds that today's best AI agents are nowhere near ready to handle real operations research work autonomously.

Researchers built ORAgentBench to test whether large language model agents can manage the full workflow of an OR task: reading a natural-language brief, ingesting multi-file data, writing and executing solution code, and producing a submission that passes hidden validators. The benchmark includes 107 human-reviewed tasks across varied operational scenarios, each in an isolated environment with strict schema and feasibility requirements. Fourteen frontier agent-model configurations were tested. The top performer passed 35.51% of all tasks and only 20.59% of the hardest ones — and many technically feasible submissions still fell below the required quality threshold.

Most AI benchmarks for optimization decouple the modeling step from the solving step, or hand agents pre-formalized problem definitions — conditions that flatter the models and obscure where they actually break down. ORAgentBench closes that gap, which is why the failure numbers look worse here than in prior work: agents are being asked to do the job, not the easy part of it.

The failure analysis points to strategic weaknesses — missed operational constraints, brittle problem formulations, and poor solution improvement loops — rather than simple coding errors, which suggests that adding more OR-specific prompting tricks is unlikely to close the gap on its own.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →