MolmoMotion: Language-guided 3D motion forecasting

Key Takeaways

MolmoMotion is a new AI model from Allen AI that predicts future 3D object motion using language instructions.
It works by taking a video frame, 3D points on an object, and a text description to forecast how the object will move.
MolmoMotion is open-source, with models, datasets, and code available on Hugging Face and GitHub.
This technology is highly useful for robotics, animation, and creating realistic, controllable video content.

Understanding and predicting how objects move in our world is a core challenge in artificial intelligence. While AI has gotten really good at seeing what has already happened in a video, the real power comes from forecasting what will happen next. This is where MolmoMotion: Language-guided 3D motion forecasting steps in, offering a groundbreaking approach to anticipate future object movements based on simple text instructions.

Released by the Allen Institute for AI (AI2) on June 17, 2026, MolmoMotion is not just another research paper; it's a practical, open-source tool designed to help developers and researchers build more intelligent systems. Imagine telling an AI, "Move the cup to the left," and having it accurately predict the cup's 3D trajectory. This capability has huge implications for robotics, video generation, and many other fields. This tutorial will guide you through understanding what MolmoMotion is, how it works, and how you can start exploring its potential.

What is MolmoMotion and Why Does it Matter?

MolmoMotion is a cutting-edge AI model that focuses on language-guided 3D motion forecasting. In simple terms, it predicts how an object will move in a three-dimensional space, and it takes specific instructions written in plain language to guide that prediction. Think of it as giving a command to an AI, like "Rotate the wooden bowl with fruit on the table," and the system then calculates the precise 3D path the bowl will take over the next few seconds.

The core problem MolmoMotion solves is the leap from retrospective perception (understanding past motion) to prospective forecasting (predicting future motion). Traditional computer vision excels at tracking objects that have already moved. However, many real-world applications, like a robot planning to grab a moving object or an AI generating a video with physically accurate actions, need to know what's going to happen next. MolmoMotion bridges this gap by providing a way for AI systems to "look forward."

What makes MolmoMotion particularly impactful is its ability to generalize. It's designed to be class-agnostic, meaning it's not limited to predicting the motion of specific types of objects like human bodies or rigid tools. It can handle a wide variety of objects and motions, from a lint roller on cloth to a car turning on a road. This versatility, combined with its view-stable representation of motion (consistent across different camera angles), makes it a powerful foundation for many downstream applications.

How MolmoMotion Works: The Technical Core

At its heart, MolmoMotion leverages the capabilities of Molmo 2, a powerful multimodal model suite, as its backbone. This allows it to connect visual information from images and videos with textual instructions. The process can be broken down into a few key stages:

Input and Understanding

MolmoMotion takes three main types of input for its predictions:

RGB Observation: This is typically a video frame or a short video history, providing the visual context of the scene and the object in question.
Query Points: A set of 3D points marked on the object whose motion you want to predict. These points define the object's initial position and shape within the 3D space.
Action Description: A written instruction in natural language that describes the intended action or movement of the object, such as "slide the book across the table" or "lift the box."

Using its Molmo 2 backbone, the model first processes these inputs to understand the scene. It identifies the specific object being referred to in the language instruction, grounds the query points to that object, and interprets the described motion.

Representing and Predicting Motion

MolmoMotion represents motion in a clever and efficient way: as object-attached 3D points in world space. This means it tracks the movement of specific points on the object rather than trying to render or predict full video frames, which is much more computationally intensive. This point-based representation is also directly usable by systems that need to reason about physical motion, like robotic arms.

The model comes in two main variants, each with a slightly different approach to forecasting:

MolmoMotion-AR (Autoregressive): This variant predicts future 3D coordinates step by step, much like a language model predicts the next word in a sentence. It treats 3D coordinates as structured text. Each new coordinate prediction is conditioned on the coordinates already generated, which helps create smooth and coherent trajectories, especially when the future path is clear.
MolmoMotion-FM (Flow-matching): This variant predicts trajectories in a continuous 3D space. It uses a technique called flow-matching to transform noise into motion. This approach is particularly effective when the language instruction might allow for multiple plausible future motions, as it can better represent this uncertainty.

Both variants are designed to produce highly accurate 3D trajectories that align with the given language instructions and the visual context.

Setting Up MolmoMotion: A Developer's Guide

As an open-source project, MolmoMotion is designed for researchers and developers to integrate, experiment with, and build upon. The Allen Institute for AI has made the model weights, dataset, and code publicly available, which is fantastic for the community.

Prerequisites

Before you dive in, ensure your system meets some basic requirements. While specific hardware isn't detailed, working with 3D motion forecasting models typically benefits from:

GPU: A powerful NVIDIA GPU with sufficient VRAM is highly recommended for training and even for efficient inference, as these models are computationally intensive.
Python: A recent version of Python (e.g., 3.8+) is usually required.
Deep Learning Frameworks: MolmoMotion likely relies on PyTorch or TensorFlow, given its nature as a deep learning model.
Git: For cloning the repository.
Conda or Venv: For managing your Python environment and dependencies.

Step 1: Clone the Repository

The first step is to get the MolmoMotion code onto your local machine. You can do this by cloning the official GitHub repository:

git clone https://github.com/allenai/molmo-motion.git
cd molmo-motion

You can find the official code repository here: MolmoMotion GitHub Repository.

Step 2: Set Up Your Environment and Install Dependencies

It's always a good practice to create a dedicated virtual environment to avoid conflicts with other Python projects.

Using Conda:

conda create -n molmomotion python=3.9
conda activate molmomotion
pip install -r requirements.txt

Using Venv:

python -m venv molmomotion_env
source molmomotion_env/bin/activate  # On Windows, use `molmomotion_env\Scripts\activate`
pip install -r requirements.txt

The requirements.txt file in the cloned repository will list all necessary Python libraries. You might also need to install specific PyTorch versions compatible with your GPU setup.

Step 3: Download Pre-trained Models and Datasets

To use MolmoMotion effectively, you'll need its pre-trained model weights. AI2 has made these available on Hugging Face. You'll also likely want access to the MolmoMotion-1M dataset and the PointMotionBench benchmark for testing and further research.

Models: You can find the collection of MolmoMotion models on Hugging Face: Hugging Face MolmoMotion Collection. Follow the instructions on the Hugging Face page or in the GitHub repository to download the specific model weights (e.g., MolmoMotion-AR or MolmoMotion-FM).
Dataset: The MolmoMotion-1M dataset, which is crucial for training or fine-tuning, is also available: MolmoMotion-1M Dataset on Hugging Face.
Benchmark: For evaluating your own implementations or understanding the model's performance, the PointMotionBench benchmark is provided: MolmoMotion Project Page.

Typically, downloaded model weights should be placed in a designated directory within your cloned repository, as specified in the project's documentation.

Step 4: Running Inference with MolmoMotion

Once everything is set up, you can start using MolmoMotion to forecast 3D motion. The exact command will depend on the specific scripts provided in the GitHub repository, but the general workflow usually involves:

Preparing Input: You'll need an RGB observation (e.g., an image file or a video clip), a definition of the 3D query points on the object (often provided as a file or programmatically), and your natural language action description.
Executing the Inference Script: The repository will likely contain an example script for running predictions. This script will take your inputs and the path to the downloaded model weights.

Here's a hypothetical example of what a command might look like (refer to the official GitHub for actual usage):

python run_prediction.py \
    --model_path /path/to/your/molmomotion_ar_weights.pth \
    --image_path /path/to/your/input_image.jpg \
    --query_points_file /path/to/your/object_points.json \
    --action_description "move the red ball to the left" \
    --output_trajectory_file predicted_trajectory.json

The output would typically be a file (e.g., JSON or CSV) containing the predicted 3D coordinates of the query points over time, representing the object's future trajectory. You would then need visualization tools to render this 3D trajectory.

Example Use Cases for Developers

Robotics Simulation: Integrate MolmoMotion's predictions into a robotics simulator to test how a robot would react to an object moving as described by language. For example, predict "the box slides forward" and then program a robot arm to intercept it. MolmoMotion has shown significant improvements in robotics planning, achieving a 76.3% success rate on pick-and-place tasks in simulation compared to 56.0% for a Molmo 2 baseline.
Controllable Video Generation: Use the predicted 3D trajectories to guide video generation models, ensuring that generated video content accurately follows specific linguistic instructions for object movement. This can lead to more physically plausible and controllable AI-generated videos.
Interactive VR/AR Environments: Develop applications where users can verbally instruct virtual objects to move, and MolmoMotion predicts these movements in real-time, enhancing realism and interaction.
Animation Tools: Assist animators by generating initial 3D motion paths for objects based on text descriptions, speeding up the animation workflow.

Limitations to Keep in Mind

While MolmoMotion is a significant step forward, it's important to be aware of its current limitations. The model uses eight query points per object during training. While sufficient for many useful trajectories, this number can limit its ability to densely represent and forecast complex deformable motions. This means highly intricate deformations of soft or pliable objects might still pose a challenge.

Conclusion

MolmoMotion represents an exciting leap in AI's ability to understand and predict the physical world. By combining visual perception with natural language instructions, it empowers developers and researchers to create systems that can anticipate future events, leading to more intelligent robots, more realistic AI-generated content, and more intuitive human-computer interactions. The open release of its models, data, and code by AI2 is a testament to its commitment to advancing the field and inviting the community to build upon this powerful foundation.

If you're working in robotics, computer vision, animation, or any field that benefits from predictive motion, MolmoMotion offers a robust and flexible tool to explore. We encourage you to download the weights, explore the dataset, and experiment with its capabilities to see how it can enhance your projects.

Frequently Asked Questions

What is the main purpose of MolmoMotion?

MolmoMotion's main purpose is to predict the future 3D motion of objects in a scene, guided by natural language instructions. It helps AI systems anticipate how objects will move rather than just observing past movements.

Who developed MolmoMotion and when was it released?

MolmoMotion was developed by the Allen Institute for AI (AI2) and was released on June 17, 2026.

Is MolmoMotion an open-source project?

Yes, MolmoMotion is an open-source project. The model weights, the MolmoMotion-1M dataset, and the PointMotionBench benchmark, along with the code, are openly released for the community.

What are the primary applications for MolmoMotion?

MolmoMotion has primary applications in robotics planning (allowing robots to anticipate object movements), controllable video generation (creating videos where objects move according to specific instructions), and potentially in interactive VR/AR environments and animation tools.