Skip to yearly menu bar Skip to main content


LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang · Feng Cheng · Gedas Bertasius

Arch 4A-E Poster #372
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. Solving ASD involves using audio and visual information in two complementary contexts: long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. Motivated by these observations, we propose LoCoNet, a simple but effective Long-Short Context Network that leverages Long-term Intra-speaker Modeling (LIM) and Short-term Inter-speaker Modeling (SIM) in an interleaved manner. LIM employs self-attention for long-range temporal dependencies modeling and cross-attention for audio-visual interactions modeling. SIM incorporates convolutional blocks that capture local patterns for short-term inter-speaker context. Experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets, with 95.2%(+0.3%) mAP on AVA-ActiveSpeaker, 97.2%(+2.7%) mAP on Talkies and 68.4%(+7.7%) mAP on Ego4D. Moreover, in challenging cases where multiple speakers are present, LoCoNet outperforms previous state-of-the-art methods by 3.0% mAP on AVA-ActiveSpeaker. The code will be released.

Live content is unavailable. Log in and register to view live content