AI2's MolmoAct 7B Challenges Nvidia and Google in 3D Robot Reasoning
9
What is the Viqus Verdict?
We evaluate each news story based on its real impact versus its media hype to offer a clear and objective perspective.
AI Analysis:
While the immediate impact of this research is relatively contained, the underlying technology represents a fundamental shift in robotics, generating significant future potential. The combination of LLMs and robotics is a game-changer, and the open-source nature of MolmoAct will undoubtedly accelerate development and adoption.
Article Summary
Allen Institute for AI (Ai2) has unveiled MolmoAct 7B, a groundbreaking model poised to shift the landscape of physical AI. This open-source model allows robots to ‘reason in space’ by integrating large language models (LLMs) with robotics, effectively granting them the ability to understand and interact with the physical world. MolmoAct, based on Ai2’s open-source Molmo, ‘thinks’ in three dimensions, processing data inputs like video into spatially grounded tokens – distinct from traditional vision-language-action (VLA) models. The model estimates distances between objects and then predicts a sequence of ‘image-space’ waypoints, enabling actions like adjusting an arm or stretching out. Ai2’s research resulted in a task success rate of 72.1%, surpassing models from industry leaders like Google, Microsoft and Nvidia. This represents a significant step towards true physical intelligence, a long-held dream for robotics developers. The model’s open-source nature and ease of adaptation – demonstrated by its ability to function across different robot embodiments with minimal fine-tuning – are particularly noteworthy. This innovation has fuelled existing interest in physical AI, driven by advancements from Google Research (SayCan) and Meta/NYU (OK-Robot), and the recent release of Hugging Face’s desktop robot. The open source element of the project has been widely praised by the robotics community.Key Points
- MolmoAct 7B is an open-source model developed by Ai2 that allows robots to ‘reason in space’ by understanding and interacting with the physical world.
- The model uses spatially grounded tokens to represent data inputs, enabling robots to gain a 3D understanding of their surroundings and plan actions accordingly.
- MolmoAct outperformed models from Nvidia, Google, and Microsoft in a benchmark task, highlighting its potential to advance the field of physical AI.

