A research team has found a cheaper path to teaching AI systems how to understand 3D spaces — and it skips the specialized geometry hardware most rivals require.
Current Vision-Language Models that handle 3D scene understanding typically bolt on custom geometry encoders or demand enormous training runs to develop spatial reasoning. OneCanvas takes a different route: it maps image patches from every camera view onto a single equirectangular panoramic canvas, pinning each patch to the longitude and latitude that corresponds to its real-world position. Depth information — which would otherwise get lost in that 2D projection — is reintroduced through a separate position embedding. The result is a unified spatial representation that a standard pretrained VLM can read like any ordinary image, with no major changes to the underlying model architecture.
The payoff is significant: OneCanvas hits state-of-the-art scores on the SQA3D and VSI-Bench benchmarks while using roughly one-tenth the training compute of its closest competitors. That gap matters because compute cost is one of the main reasons capable 3D-aware models stay locked inside large labs. A method that achieves comparable accuracy on a fraction of the budget is a plausible on-ramp for robotics and embodied AI teams that cannot afford to train from scratch.
The canvas can be recentered on any viewpoint of interest, which makes situated reasoning — knowing where you are in a scene, not just what objects are present — a native feature rather than an afterthought. That is a quiet but meaningful advantage in robotics, where most competing approaches treat egocentric reasoning as a secondary use case rather than a design constraint. Whether the efficiency holds on messier, real-world deployments beyond benchmark datasets is the question the paper does not yet answer.