Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
Abstract
Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity to text descriptions involving motion. However, the understanding of how Video DiTs convert motion words into video remains lagging behind. Furthermore, prior studies on interpretable saliency maps primarily target objects, leaving it behind to observe how Video DiTs behave with respect to motion. In this paper, we inquire into concrete motion features that specify which object moves and at what time for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively renders per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose an automatic motion-feature selecting algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motions spatially and temporally. Our methods discover concept saliency maps without the need for any gradient-based training or parameters. Experimentally, our methods show standout localization capability in the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.