A new paper outlines AgentBeats, a system that lets AI agents be evaluated by other AI judges via a single protocol.
The authors argue that current benchmarks require two separate interfaces—one for the test suite and another for the agent under test—making set‑up cumbersome and limiting cross‑agent comparison. Their Agentified Agent Assessment (AAA) replaces this with two standardized protocols: A2A for task management and MCP for tool access. They report a five‑month open competition that attracted 298 judge agents and 467 subject agents across 12 categories, plus a controlled study on coding agents that preserved result fidelity while revealing new head‑to‑head data.
If the approach works, it could level the playing field for diverse agent architectures, reduce the engineering overhead of benchmarking, and make results more reproducible. Researchers would no longer need custom harnesses for each new agent, and the community could more easily compare open‑source, proprietary, and multimodal systems.
The effort is still early, and adoption will depend on whether major labs adopt the protocols instead of their own closed suites.