Skip to yearly menu bar Skip to main content


Modality-Collaborative Test-Time Adaptation for Action Recognition

Baochen Xiong · Xiaoshan Yang · Yaguang Song · Yaowei Wang · Changsheng Xu

Arch 4A-E Poster #249
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Video-based Unsupervised Domain Adaptation (VUDA) method improves the generalization of the video model, enabling it to be applied to action recognition tasks in different environments. However, these methods require continuous access to source data during the adaptation process, which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem, in this paper, we propose to solve the Multimodal Video Test-Time Adaptation task (MVTTA). Existing image-based TTA methods cannot be directly applied to this task because video have domain shift in multimodal and temporal, which brings difficulties to adaptation. To address the above challenges, we propose a Modality-Collaborative Test-Time Adaptation (MC-TTA) Network. We maintain teacher and student memory banks respectively for generating pseudo-prototypes and target-prototypes. In the teacher model, we propose Self-assembled Source-friendly Feature Reconstruction (SSFR) module to encourage the teacher memory bank to store features that are more likely to be consistent with the source distribution. Through multimodal prototype alignment and cross-modal relative consistency, our method can effectively alleviate domain shift in videos. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods.

Live content is unavailable. Log in and register to view live content