medical-vqa/ vision-language · inference-efficiency

Game-theoretic decoding cuts errors in tiny medical VQA models

A new Wasserstein stopping rule boosts accuracy and speeds up inference for 2‑8 B vision‑language models on clinical question answering benchmarks.

A novel decoding technique improves small vision‑language models on medical visual question answering.

Researchers extended game‑theoretic decoding, once limited to text‑only tasks, to vision‑language models. They replaced the lexical matching check with a Wasserstein‑based stopping criterion that looks for semantic consensus among near‑synonymous answers. The change lets the model stop early when candidates are clinically equivalent, cutting unnecessary iterations.

On the VQA‑RAD dataset the 2 B Qwen3‑VL model jumped 3.5 percentage points over greedy decoding, even outpacing a 4 B greedy baseline. Similar gains appeared on PathVQA, where a 3‑4 B Gemma model matched a 4 B MedGemma model without any domain‑specific fine‑tuning. The new criterion also shaved roughly 20 % off the average number of decoding steps, helping meet the low‑latency demands of on‑device or on‑premise deployment.

The work shows that modest‑size models can close the performance gap with larger, fine‑tuned peers by improving inference logic rather than scaling parameters. If hospitals adopt such techniques, they may keep patient data local while still getting reliable answers, a trade‑off that has been hard to achieve with earlier decoding methods.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →