Nvidia Cosmos 3 Unifies Vision, Audio, and Action in One Model

Nvidia has released Cosmos 3, an open-source model family that processes and generates five modalities at once: language, images, video, audio, and physical action sequences.

The system uses a mixture-of-transformers architecture that Nvidia says replaces the need for separate vision-language models, video generators, and robot policy models. At publication time, third-party evaluators Artificial Analysis ranked its post-trained checkpoints first among open-source text-to-image and image-to-video models. RoboArena ranked the policy model first in its class. Code, weights, synthetic datasets, and benchmarks are available on GitHub and Hugging Face under the Linux Foundation's OpenMDW-1.1 License.

The practical pitch is consolidation: one model backbone instead of a patchwork of specialized systems, which matters for robotics and embodied AI where coordinating separate models adds latency and integration debt. If the benchmark rankings hold up under independent scrutiny, Cosmos 3 could shift the baseline expectation for what an open physical AI stack looks like.

Nvidia already sells the hardware these models run on, so releasing capable open weights is also a straightforward way to drive demand for its chips — generosity and self-interest rarely conflict here.

← Back to the front page