Skip to yearly menu bar Skip to main content


Poster

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment

Huangbiao Xu · Xiao Ke · Huanqi Wu · Rui Xu · Yuezhou Li · Wenzhong Guo


Abstract:

Long-term sports assessment is a challenging task in video understanding since it requires judging complex movement variations as well as action-music coordination. However, there is no direct correlation between the diverse background music and movements in sporting events. Previous works require larger model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learning (MLAVL) framework that models audio-action-visual correlations guided by low-cost language modality. In our framework, multidimensional domain-based actions form action knowledge graphs, motivating audio-visual modalities to focus on task-relevant actions. We further design a shared-specific context encoder to integrate deep multimodal semantics, and an audio-visual cross-modal fusion module to evaluate action-music consistency. To match the sport's rules, we then propose a dual-branch prompt-guided grading module to weigh both visual and audio-visual performance. Extensive experiments demonstrate that our approach achieves state-of-the-art on four public long-term sports benchmarks while maintaining low parameters. Our code will be available.

Live content is unavailable. Log in and register to view live content