Skip to yearly menu bar Skip to main content


Poster

ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations

Yuanhang Zhang · Shuang Yang · Shiguang Shan · Xilin Chen

Arch 4A-E Poster #281

Abstract: We propose a novel strategy, ES$^3$, for self-supervised learning of robust audio-visual speech representations from unlabeled talking face videos. While many recent approaches for this task primarily rely on guiding the learning process using the audio modality alone to capture information shared between audio and video, we reframe the problem as the acquisition of *shared*, *unique* (modality-specific) and *synergistic* speech information to address the inherent **asymmetry** between the modalities. Based on this formulation, we propose a novel "evolving" strategy that progressively builds joint audio-visual speech representations that are strong for both uni-modal (audio & visual) and bi-modal (audio-visual) speech. First, we leverage the more easily learnable audio modality to initialize audio and visual representations by capturing audio-unique and shared speech information. Next, we incorporate video-unique speech information and bootstrap the audio-visual representations on top of the previously acquired shared knowledge. Finally, we maximize the total audio-visual speech information, including synergistic information to obtain robust and comprehensive representations. We implement ES$^3$ as a simple Siamese framework and experiments on both English benchmarks and a newly contributed large-scale Mandarin dataset show its effectiveness. In particular, on LRS2-BBC, our smallest model is on par with SoTA models with only 1/2 parameters and 1/8 unlabeled data (223h).

Chat is not available.