Introducing the first LLM-based Motion understanding model: MotionLLM

As this is a blog post, we would only want to highlight the key features and insights of the MotionLLM, and discuss its fruitful applications. As a research work, we would like to emphasize something most essential.

Q1: Are previous methods bad enough?

Yes. We plan to answer this question in two parts. 

1) The completive baseline on the video understanding is Video-LLaVA, which is also acknowledged by us as a good work. However, it cannot follow the instructions well in inference. Especially, it tends to describe the environmental information when I ask it something related to human behaviors. 2) For the motion part, there exists a larger gap toward “good” understanding. The existing largest model (MotionGPT, from Jiang et al.) cannot be scaled to a billion level. Besides, it is not equipped with a good generalization ability, especially reasoning.

Those above motivate us to develop the MotionLLM. Many peers in the motion community discussed with me “Why can motion-related model not be scaled to an x-B level?”. Today, MotionLLM does it.

Q2: Why do the supported modalities are motion and videos?

First, motion data is less redundant without anything bothering it and it is much more private. In this setting, motion data is easier for deep models to compress. However, motion data is less grounded, especially some “non-physical performs” examples. Fortunately, this is what videos can provide. That is our basic motivation.

Q3: What is our research target?

We aim to close 2 loops.

We analyze these things when starting the project. I discussed with Shunlin why previous methods failed. We summarize the issues from the following folds.

For the first issue, our technical solution is quite quite simple. Motivated by LLaVA, we bridge the modality gap via a linear projection layer. However, different from the Video-LLaVA and LLAVA, motion has a larger modality gap with video. Thus, they do not share the projection layer. Note that both motion and video share the knowledge in the LLM part, where they help with each other. To make full use of “motion-text-video” triple data, we take them into training and find them really jointly prompt better results. This additionally helps us to answer the second question (Q2) above.

For the second issue, we paid tens of thousands of $$USD$$ to annotate the data, including the caption and QAs. I believe in "no dirty work, not perfect result.". We think this will be quite useful to the community.

For results, I am a bit lazy in showing the SOTA result. I want to highlight some applications of MotionLLM.

… …

For details, please visit our homepage ( 

* By Ling-Hao Chen and Shunlin Lu. Credits also with other co-authors.