LVLMs now hallucinate fewer objects thanks to a simple attention tweak.
The arXiv paper identifies an uneven split of attention—both between the image and text streams and among individual tokens—as a key driver of false object mentions. The authors package this insight into a metric called attention imbalance and show visual maps of where models over‑focus on irrelevant words or under‑focus on salient visual features. Their proposed solution, Attention Imbalance Rectification (AIR), reshapes attention weights at decode time without retraining. Tests on four leading LVLMs across the CHAIR, POPE and MM‑Vet benchmarks report up to a 35.1% drop in hallucination rates and modest gains of up to 15.9% on broader vision‑language tasks.
This matters because hallucinations undermine trust in safety‑critical settings like driver assistance or medical imaging, where false detections can have real costs. AIR offers a lightweight alternative to heavy fine‑tuning or data‑augmentation pipelines, making it easier to retrofit existing models.
Still, a 35% cut leaves plenty of room for error, and the method’s reliance on decoding‑time tweaks may not translate to every deployment scenario.