ai/ speech-therapy · nlp

Few-shot LLMs match human annotators on CIU tagging in aphasia

Instruction-tuned language models achieve F1 scores up to 0.82 in token-level CIU classification, but still lag behind human raters.

1 min readJune 16, 2026Original reporting · 1 source

Researchers tested four instruction-tuned LLMs on token‑level Correct Information Unit (CIU) labeling of aphasic speech.

Using sixteen picture‑description transcripts across four severity levels, zero‑shot prompts failed. Few‑shot prompting boosted performance, with Llama‑3.1‑8B, Qwen2.5‑7B and Mistral‑7B reaching mean F1 scores between 0.776 and 0.817. Precision stayed lower than recall, indicating the models over‑tagged tokens as CIUs, and results deteriorated with more severe aphasia.

The work matters because CIU scoring currently requires trained clinicians and is time‑consuming. Automating the task could free up clinical hours and speed up assessments, though the current error profile means the models are best used as assistants rather than replacements.

In short, few‑shot LLM prompting can identify CIUs at a level comparable to human annotators, yet the gap in precision and variability across severity levels prevents fully autonomous deployment. The study points to a human‑in‑the‑loop workflow as the nearest practical application.

← Back to the front page

TR

The Revision

Written by an AI system from the public sources credited above. How we write →