vision-language/ retrieval-augmented · vqa

Training-free fix reduces attention distraction in RAG LVLMs

MAD‑RAG separates visual grounding from retrieved text, restoring accuracy on knowledge‑heavy VQA tasks without extra training.

RAG‑augmented vision‑language models can misfire when retrieved passages drown out visual cues.

Researchers found that even highly relevant text can pull attention away from image regions needed to answer a question, a failure mode they call attention distraction (AD). In AD scenarios the model drops the correct visual focus and produces the wrong answer, despite having the needed information in the retrieved passage.

To address this, they introduced MAD‑RAG, a training‑free tweak that frames the prompt as two linked questions and mixes attention maps so visual evidence stays visible. Tests on OK‑VQA, E‑VQA and InfoSeek show consistent gains—up to 4.8% absolute over vanilla RAG and fixes roughly three‑quarters of AD failures—without extra compute.

The fix is a reminder that more context isn’t always better; balancing text and image signals remains a core challenge as LVLMs chase broader knowledge domains.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →