Skip to yearly menu bar Skip to main content


Poster

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

Wei Zhang · Chaoqun Wan · Tongliang Liu · Xinmei Tian · Xu Shen · Jieping Ye


Abstract:

Extending large image-text pre-trained models (e.g., CLIP) for video understanding has made significant advancements. To enable the capability of CLIP to perceive dynamic information in videos, existing works are dedicated to equipping the visual encoder with various temporal modules. However, these methods exhibit "asymmetry" between the visual and textual sides, with neither temporal descriptions in input texts nor temporal module in text encoder. This limitation hinders the potential of language supervision emphasized in CLIP, and restricts the learning of temporal features, as the text encoder has demonstrated limited proficiency in motion understanding. To address this issue, we propose leveraging "MoTion-Enhanced Descriptions" (MoTED) to facilitate the extraction of distinctive temporal features in videos. Specifically, we first generate discriminative motion-related descriptions via querying GPT-4 to compare easy-confusing action categories. Then, we incorporate both the visual and textual encoders with additional perception modules to process the video frames and generated descriptions, respectively. Finally, we adopt a contrastive loss to align the visual and textual motion features. Extensive experiments on five benchmarks show that MoTED surpasses state-of-the-art methods with convincing gaps, laying a solid foundation for empowering CLIP with strong temporal modeling.

Live content is unavailable. Log in and register to view live content