Paper
in
Workshop: 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild
Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild
Andrey Savchenko · Lyudmila Savchenko
This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by lightweight pre-trained models from our EmotiEffLib library with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for four tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines. As a result, our solutions took first place in the expression classification and Ambivalence/Hesitancy recognition challenges, and third place in emotional mimicry intensity estimation and action unit detection tasks.