- Researchers tested four instruction-tuned LLMs on token‑level Correct Information Unit (CIU) labeling of aphasic speech.
- Using sixteen picture‑description transcripts across four severity levels, zero‑shot prompts failed. Few‑shot prompting boosted performance, with Llama‑3.1‑8B, Qwen2.5‑7B and Mistral‑7B reaching mean F1 scores between 0.776 and 0.817. Precision stayed lower than recall, indicating the models over‑tagged tokens as CIUs, and results deteriorated with more severe aphasia.
- The work matters because CIU scoring currently requires trained clinicians and is time‑consuming. Automating the task could free up clinical hours and speed up assessments, though the current error profile means the models are best used as assistants rather than replacements.
- In short, few‑shot LLM prompting can identify CIUs at a level comparable to human annotators, yet the gap in precision and variability across severity levels prevents fully autonomous deployment. The study points to a human‑in‑the‑loop workflow as the nearest practical application.