New Benchmark Shows LLM Agents Struggle With Real Operations Research

A new benchmark called ORAgentBench finds that today's best AI agents are nowhere near ready to handle real operations research work autonomously.

Researchers built ORAgentBench to test whether large language model agents can manage the full workflow of an OR task: reading a natural-language brief, ingesting multi-file data, writing and executing solution code, and producing a submission that passes hidden validators. The benchmark includes 107 human-reviewed tasks across varied operational scenarios, each in an isolated environment with strict schema and feasibility requirements. Fourteen frontier agent-model configurations were tested. The top performer passed 35.51% of all tasks and only 20.59% of the hardest ones — and many technically feasible submissions still fell below the required quality threshold.

Most AI benchmarks for optimization decouple the modeling step from the solving step, or hand agents pre-formalized problem definitions — conditions that flatter the models and obscure where they actually break down. ORAgentBench closes that gap, which is why the failure numbers look worse here than in prior work: agents are being asked to do the job, not the easy part of it.

The failure analysis points to strategic weaknesses — missed operational constraints, brittle problem formulations, and poor solution improvement loops — rather than simple coding errors, which suggests that adding more OR-specific prompting tricks is unlikely to close the gap on its own.

← Back to the front page