vision-language/ image-captioning · reinforcement-learning

Reinforcement learning cuts caption errors for LVLMs by 20%

A new RL framework called Cross-modal Identity Mapping trims information loss in vision‑language captioning, boosting COCO‑LN500 reasoning scores by 20%.

Large vision‑language models still drop details when they turn pictures into text. Researchers propose a way to measure that loss and teach the models to keep more of the image.

The team introduced Cross‑modal Identity Mapping (CIM), a reinforcement‑learning loop that scores captions by how well a text‑based image search can retrieve the original picture. Two metrics drive the reward: consistency of the gallery representation and relevance between the query caption and the retrieved images. No extra labeling is needed; the model learns from its own outputs. On the COCO‑LN500 benchmark, CIM raised relation‑reasoning accuracy by about 20% on Qwen2.5‑VL‑7B, beating standard supervised fine‑tuning.

If captioning quality can be judged by a round‑trip search, the field gains a cheap, scalable signal for training. That could narrow the gap between vision and language modules without the costly annotation pipelines that have limited progress so far.

The results are promising, but they hinge on one benchmark and a single model family. Wider testing will reveal whether the approach survives in more diverse settings or simply exploits quirks of the COCO‑LN500 test set.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →