A research framework called Guava can teach a 4-billion-parameter open-source model to manipulate physical objects nearly as well as much larger proprietary systems.
Guava is a "harness" — a structured scaffold that wraps an existing language model with tools for perception, planning, and control, rather than training a single end-to-end model to do everything. The researchers tested different combinations of agent workflows, action spaces, and observation spaces to find what actually works. They landed on three ingredients: iterative loops where the model perceives, reasons, and acts in sequence; high-level action abstractions that hide low-level motor details; and multimodal inputs that combine vision with language. Using fewer than 2,000 simulated training trajectories, they distilled those capabilities into a 4B open-source model. In both simulation and real-world tests, it matched the performance of frontier proprietary models on unseen objects, novel instructions, and long-horizon tasks.
The significance isn't the robot arm — it's the argument that the scaffold matters as much as the model. Most robotics AI research chases bigger models or more data; Guava suggests that careful interface design can extract strong performance from compact, cheap-to-run models. That has real implications for labs that can't afford to train or serve frontier-scale systems.
The caveat: "comparable to frontier proprietary models" is a claim that deserves scrutiny until independent benchmarks confirm it — and simulation-to-real transfer has a long history of looking better on paper than on a factory floor.