MMHU:
A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

Renjie Li^1*, Ruijie Ye^2*, Mingyang Wu^1*, Hao Frank Yang³, Zhiwen Fan⁴, Hezhen Hu^4✉, Zhengzhong Tu^1✉

¹Texas A&M University ²Brown University ³Johns Hopkins University ⁴UT Austin

^*equal contribution ^✉corresponding author

We propose MMHU, a large-scale dataset for human behavior understanding. We collected 57k human instances with diverse behaviors such as playing with a mobile phone, holding an item, or using mobility devices, from diverse scenes such as in the city, school, park, and alley. We provide rich annotations including motion and trajectory, text descriptions for human motions, and recognize the behaviors that are critical to driving safety. MMHU can facilitate many downstream tasks such as motion prediction, motion generation, or behavior VQA.

MMHU Dataset

The statistics of the MMHU dataset.

Some examples of MMHU: each sample contains the motion sequence rendered on the original image, the behavior tags, and the text descriptions.

Behavior:Talking, Crossing

Description:The pedestrian is walking cross the street while talking to the one near them.

Behavior:Stroller, Crossing

Description:The pedestrian continues to walk across the street, pushing the stroller.

Behavior:Bike

Description:The pedestrian rides a bicycle along the sidewalk.

Behavior:Carrying Items,

Description:The pedestrian walks forward, with a bag in their hand.

Behavior:Using Phone,

Description:The pedestrian walks forward, shifting their weight from one foot to the other, and occasionally adjusts their posture as they move.

Behavior:Motorcycle,

Description:The pedestrian rides a scooter, maintaining a steady pace and slight lean forward, with hands positioned on the handlebars.

Behavior:Using an Umbrella,

Description:The pedestrian walks steadily, maintaining a consistent pace and posture, with slight adjustments in leg positioning and arm swing.

Behavior:Walking a pet,

Description:The pedestrian walks steadily down the sidewalk, maintaining a consistent pace with slight adjustments in their stance and arm movements.

Behavior:Scooter,

Description:The pedestrian continues riding the scooter, maintaining a steady pace and posture, with minimal changes in their arm and leg movements.

Behavior:Skateboard,

Description:The pedestrian walks forward, leaning slightly forward and to the left, with their right foot lifted and their left foot positioned beneath it. Their arms swing naturally as they walk, with the right hand moving above the left.

Annotation Pipeline

We collect data from three sources: the Waymo dataset, the YouTube videos, and the self-collected or paid driving videos. We designed a labeling pipeline to obtain high-quality data annotation with minimal human effort.

Facilitating Downstream Tasks

The current Motion Generation approaches are not capable of generating human motion in the street context (Left of each example). With fine-tuning on MMHU, they can properly generate such human motions (Right of each example).

Fine-tuning on MMHU also improves the performance of the baseline models on
Motion Prediction, Intention Prediction, and Behavior VQA.

BibTeX


@misc{li2025mmhumassivescalemultimodalbenchmark,
      title={MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding}, 
      author={Renjie Li and Ruijie Ye and Mingyang Wu and Hao Frank Yang and Zhiwen Fan and Hezhen Hu and Zhengzhong Tu},
      year={2025},
      eprint={2507.12463},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.12463}, 
}