ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub

AllenAI Unveils MolmoMotion: Language-Guided 3D Motion Forecasting for Robotics and Video Synthesis

3D motion forecasting object-grounded 3D point trajectories MolmoMotion trajectory-conditioned video generation robotics planning PointMotionBench
June 17, 2026
Viqus Verdict Logo Viqus Verdict Logo 8
High-Fidelity Prediction Engine for Embodied AI
Media Hype 6/10
Real Impact 8/10

Article Summary

The AllenAI team released MolmoMotion, a novel motion forecasting model designed to predict how specific 3D points on an object will move over time, given an initial video observation and a natural language action description. Unlike retrospective motion tracking, MolmoMotion anticipates future physical movement, a critical capability for applications ranging from sophisticated robotics planning to highly controlled, plausible video generation. The architecture is built upon Molmo 2 and predicts trajectories using two variants: an autoregressive (MolmoMotion-AR) approach for step-by-step prediction, and a flow-matching (MolmoMotion-FM) approach for representing continuous uncertainty. Crucially, the release includes MolmoMotion-1M, a massive, newly compiled dataset of object-grounded 3D point trajectories paired with action descriptions, and PointMotionBench, a human-validated benchmark for quantitative 3D motion accuracy evaluation.

Key Points

  • MolmoMotion shifts the paradigm from observing historical motion to predicting future 3D trajectories based on language prompts and initial object observation.
  • The model predicts trajectories using sparse 3D points attached to an object, providing a general, view-stable, and highly compressible representation for downstream systems.
  • The release is comprehensive, offering the model weights, the massive MolmoMotion-1M dataset, and the rigorous PointMotionBench benchmark to accelerate community research and integration.

Why It Matters

This is a significant technical advance for embodied AI and generative systems. By providing a robust, language-conditioned mechanism for predicting object dynamics in 3D space, MolmoMotion moves beyond simply generating plausible *frames* of motion; it forecasts the *physics* and *geometry* of the motion itself. For robotics, this provides critical look-ahead capability for planning manipulation tasks. For video, it allows for physically grounded and controllable video generation. While the methodology is complex, the open release of the massive dataset and benchmark makes this an immediate, actionable resource for industrial and academic researchers focused on real-world physical interaction.

You might also be interested in