Vision‑language models can confidently answer questions even when the image provides no relevant evidence, a flaw dubbed “mirage.”
Researchers introduced Text‑Conditioned Layer‑wise Internal Alignment (TC‑LIA), a model‑agnostic detector that watches how image patch tokens align with a question across CLIP ViT‑H/14 layers. By projecting intermediate tokens into the final CLIP space and measuring cosine similarity, the method builds a trajectory of visual relevance. Combined with pixel‑level blank detection, zero‑shot domain routing and VLM self‑assessment, the ensemble was tested on five VQA domains and twelve backbones.
The approach pushes detection accuracy to 94.7% for the 32‑billion‑parameter Qwen2.5‑VL model, slashing mirage rates to 3.0% compared with baseline errors between 21.7% and 66.6%. For safety‑critical fields like medical imaging, catching a hallucination before it’s spoken could prevent false confidence in AI‑generated reports.
In context, this is the first systematic pre‑answer filter for VLMs, echoing earlier work on text‑only hallucination detection but extending it to multimodal reasoning.
Bottom line: TC‑LIA shows that VLMs can be made to self‑pause when visual evidence is missing, offering a practical safeguard as these systems move into high‑stakes applications.