- A new computational framework called DEFINED can assess fine‑grained creativity in debate scenarios.
DEFINED builds a hierarchical eight‑dimensional metric and trains a pre‑trained autoregressive language model with a special scoring head. The authors collected statements and expert scores from real debate competitions, then used constrained data augmentation to soften the elite‑bias in the source data. A mixed‑granularity training regime lets the model learn from a handful of graduate‑expert annotations.
The system outperforms both prompt‑based large language model evaluators and existing debate‑scoring methods on a suite of tests, including a study with participants who have never debated before. This suggests the model can generalize beyond the elite data it was trained on, offering a cheaper alternative to costly human judges.
If the field can reliably automate nuanced creativity metrics, researchers may finally move beyond the simplified tasks that dominate current LLM evaluation. Still, the approach hinges on a narrow set of debate data, so broader claims about creative ability across domains remain premature.