Ai2's MolmoAct 7B Challenges Nvidia & Google in 3D Physical Reasoning
8
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
While the immediate impact might seem contained within research benchmarks, the foundational technology’s open-source release will inevitably generate broader interest and development, driving significant, long-term impact.
Article Summary
Ai2’s newly released MolmoAct 7B represents a significant step forward in the development of robots capable of understanding and interacting with the physical world. Utilizing Large Language Models (LLMs) alongside foundation models, MolmoAct allows robots to ‘think’ in three dimensions, interpreting spatial relationships and planning actions accordingly. The model’s key innovation lies in its ability to output "spatially grounded perception tokens," a novel approach distinct from traditional vision-language-action (VLA) models, enabling a deeper understanding of surrounding environments. Benchmarking tests demonstrate MolmoAct’s superior performance, exceeding the success rates of models from Google, Microsoft, and Nvidia. Crucially, the open-source nature of the model, coupled with its readily accessible training data, is expected to accelerate research and development within the burgeoning physical AI space. This news comes as interest in developing more spatially aware robots—a long-held dream—is growing, bolstered by advancements in LLMs.Key Points
- MolmoAct 7B, developed by Ai2, is an open-source model that allows robots to ‘reason in space’ through 3D spatial understanding.
- The model’s innovation is its use of ‘spatially grounded perception tokens,’ distinct from traditional VLA models, offering a deeper understanding of the physical world.
- MolmoAct outperformed models from Google, Microsoft, and Nvidia in initial benchmarking tests, highlighting the potential of this approach.

