LLMs now report confidence scores, but the scale they use matters.
Researchers tested six models on three datasets, varying how confidence was encoded. Across all setups, the models hoarded their answers around three round numbers—typically 0, 50 and 100—leaving the rest of the scale unused. By reshaping the scale to 0‑20, tightening or loosening its boundaries, and making the range irregular, the team measured metacognitive sensitivity with meta‑d'. The 0‑20 scale consistently raised meta‑d' scores, indicating sharper self‑knowledge. Shrinking the range’s edges hurt performance, and the preference for round numbers survived even when the scale was lopsided.
The takeaway is practical: confidence scales are not a neutral overlay. A tighter, low‑range scale can extract more nuanced uncertainty signals from LLMs, which matters for any downstream task that relies on model confidence—risk assessment, active learning, or human‑in‑the‑loop pipelines.
In short, the study shows that a simple design tweak improves LLM metacognition, urging researchers to treat confidence scales as experimental parameters rather than an afterthought.