CEO-Bench Tests Whether AI Agents Can Run a Company

Most AI benchmarks ask models to sprint. CEO-Bench asks them to run a marathon.

Researchers introduced CEO-Bench, a benchmark that drops language model agents into a simulated startup and tells them to keep it alive for 500 days. The agent manages pricing, marketing, budgeting, and other operational decisions through a Python interface — facing the same noisy, interconnected data that a real executive would. The bar for success is modest: finish with more than the $1M starting balance. Most models fail even that. Only Claude Opus 4.8 and GPT-5.5 cleared it, and neither managed to turn a consistent profit.

The result is a useful corrective to recent AI hype. Vendors have leaned hard on benchmark scores for coding and customer service — tasks that are isolated, short, and well-defined. CEO-Bench probes four capabilities those evals ignore: handling long time horizons under uncertainty, extracting signal from noisy data, adapting to changing conditions, and coordinating multiple decisions toward a single goal. Those are precisely the things that make real work hard, and the gap between "good at sprints" and "good at marathons" turns out to be wide.

The strongest models wrote code to simulate customer cohorts and mine negotiation histories — creative workarounds, not just instruction-following. That hints at where the ceiling is: agents can improvise tactically but still struggle to sustain a coherent strategy across hundreds of decision points.

For a field that routinely announces agents are ready to replace knowledge workers, a benchmark where the best models can barely break even running a fictional lemonade stand is a useful data point.

← Back to the front page