CVPR Poster STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Poster

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Xiaoyu Zhu · Po-Yao Huang · Junwei Liang · Celso M. de Melo · Alexander G. Hauptmann

West Building Exhibit Halls ABC 145

[ Abstract ] [ Project Page ]

[ Paper PDF] [ Poster]

Abstract:

We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.

Chat is not available.