LLM‑powered agents are now being watched like genomes.
Researchers encoded 347 real‑world runs of a ReAct‑style agent into four symbols—Explore (X), Execute (E), Plan (P) and Verify (V). Mining n‑grams, they flagged the trigram P‑X‑P as a statistically significant failure pattern, shaving 10.4 % off success rates. A low P‑ratio also predicted poorer outcomes (r = ‑0.256). To counter this, they built Governor, a three‑layer runtime guard that injects rules, aggregates statistics and adapts thresholds via chi‑square tests. In a before‑after test (101 vs. 246 runs), success rose 6.2 % points while token consumption dropped 44 %.
The take is that symbolic trace analysis can expose hidden failure modes that pure reward signals miss, offering a cheap, model‑agnostic safety layer. It also suggests that other agents may harbor similar “verification deficits,” a hypothesis the authors confirmed on 2,000 SWE‑bench runs.
If the approach scales, we might see a new class of lightweight overseers rather than ever‑larger models chasing the same goal.