An academic team has released Vero, a fully open family of vision-language models that match or beat closed-pipeline competitors on a wide range of visual reasoning benchmarks.
The researchers built Vero-600K, a 600,000-sample training dataset drawn from 59 existing datasets, spanning six task categories including charts, science questions, and spatial reasoning. They paired that data with task-routed rewards — a system that scores model answers differently depending on which category is being tested. Trained across five base models, Vero variants improved 2.9 to 5.4 points on average over their starting checkpoints. The flagship, Vero-Qwen3I-8B, beat Qwen3-VL-8B-Thinking by 3.8 points on average without leaning on knowledge distillation from a larger model.
The significance here is less about benchmark numbers and more about access. Leading visual-language models from major labs — the ones consistently topping leaderboards — keep their training data and reinforcement learning pipelines private, which means outside researchers cannot audit, reproduce, or build on the gains. Vero puts the dataset, code, and models into the open, giving the research community something to actually stress-test.
The paper's own ablations are the most interesting part: different task categories trigger distinct reasoning patterns, and the model only captures broad gains when trained on all six jointly. That finding complicates the common assumption that visual reasoning is a single transferable skill — and suggests closed labs optimizing narrow benchmarks may be leaving general capability on the table.