Paper
in
Workshop: 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild
Learning Pose-aware Representations in Vision Transformers for Understanding Activities of Daily Living
Dominick Reilly · Srijita Das · Srijan Das
To meet the demands of the increasing elderly population, health monitoring applications requiring fine-grained understanding of Activities of Daily Living (ADL) will be crucial. In computer vision, human action recognition is strongly influenced by observable poses, with ADL recognition typically leveraging pose-centric features such as human skeletons. Traditional Vision Transformer (ViT) architectures, however, process input image patches uniformly, neglecting crucial pose priors in video data. This work integrates pose priors into the training of ViTs, demonstrating their effectiveness in learning fine-grained and viewpoint agnostic representations for improving ADL understanding capabilities. We introduce the Pose-aware Attention Block (PAAB), a plug-and-play ViT block that performs localized attention on pose regions within videos. ViTs equipped with PAAB succeed in learning pose-aware representations, enhancing performance on a diverse set of downstream tasks. Our experiments, conducted across seven datasets, reveal the efficacy of PAAB on video action understanding and robot learning tasks. We show that PAAB outperforms respective backbone ViTs by up to 9.8% in real-world action recognition, and up to 21.8% in multi-view robotic video alignment tasks.