A research framework called CaVe-VLM-CoT aims to make vision-language models stop making things up — by forcing them to cite their sources at every reasoning step.
Vision-language models are notoriously prone to hallucinations: they produce fluent, confident outputs that simply don't match the images they're analyzing. Existing fixes, like chain-of-thought prompting or retrieval-augmented generation, patch parts of the problem but don't close the loop — a model can still reason from an unverified claim without being sent back to re-check. CaVe-VLM-CoT addresses this with five sequential stages: Extractor, Retriever, Solver, Citation Injector, and Verifier. When the Verifier catches an ungrounded claim, it routes structured feedback back to the Extractor for targeted re-retrieval rather than letting the error propagate. The researchers also propose 23 component-wise evaluation metrics, anchored by a composite score called CaVeScore, which weights accuracy, citation precision and recall, attribution, and evidence grounding together.
The benchmark numbers give a clearer picture of where the system stands: on ScienceQA, the framework hits 87.1% accuracy and a CaVeScore of 56.6%; on the harder MMMU benchmark across 30 subjects, accuracy drops to 55.2% with a CaVeScore of 35.7%. The gap between the two scores matters — MMMU's breadth exposes how much grounding quality degrades when the subject domain widens, which is exactly the condition real deployments face.
The framework requires no architectural changes or prompt modifications, which lowers the barrier to adoption — but a CaVeScore of 35.7% on MMMU is a reminder that "interpretable" and "reliable" are not yet the same thing.