MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Abstract
Understanding animal behavior requires modeling how bodies, objects, and other agents interact over time, not simply detecting isolated actions or estimating pose frame by frame. Existing animal video datasets target pose estimation or coarse, passively observed actions, and rarely provide the structured, multi-entity interaction annotations needed to study behavioral dynamics. We introduce MooCap, a multi-view video benchmark for animal-object-human interaction understanding under controlled experimental protocols. MooCap contains 42 hours of synchronized multi-camera video from 43 individually tested cows across seven standardized interaction scenarios, including novel environment, novel object, novel human, human approach, unfamiliar conspecifics (restricted and unrestricted) and Dam reunion (restricted and unrestricted). Recordings are densely annotated with 23 fine-grained behaviors, 39 body keypoints across 157 test sessions, 4 spatial zones, and 43 subjects, describing interactions among subjects, objects, humans, and other cattle. We establish three benchmarks on MooCap: (1) dense temporal action segmentation over 1200-1500-second sequences; (2) pose-based behavior and interaction recognition from keypoint trajectories; and (3) longitudinal behavioral classification linking adult behaviors with rearing conditions. Benchmarking results reveal that state-of-the-art temporal segmentation models achieve only 66.4\% frame accuracy and 30.6\% F1@0.5, with performance degrading further during interaction-heavy segments. Overall, MooCap bridges multi-view pose estimation, multi-entity tracking, and structured behavioral protocols to enable interaction-aware models for animal behavior analysis.