Occluded Human Body Capture with Frequency Domain Denoising Prior
Abstract
Monocular human motion capture in occlusion scenarios presents significant challenges. Although a few works have explicitly considered the occlusion problem, image-based methods are unreliable due to the lack of temporal constraints while video-based approaches cannot gain sufficient knowledge from time domain motion priors to address long-term occlusions. However, occluded human motion typically exhibits periodic patterns and consistent momentum. Inspired by this observation, we exploit reliable image observations in frequency domain and formulate the motion capture task as a wavelet coefficients selection process. Specifically, we first construct probabilistic distributions for the occluded 2D keypoints, and then introduce a frequency domain diffusion model to refine the distributions by learning long-term periodic information and physical momentum with Discrete Wavelet Transform (DWT). Consequently, the learned denoising prior can select valid wavelet components to facilitate the 3D motion capture with a 3D decoder. By employing a joint reprojection strategy, we can also use the same diffusion process to train the 3D decoder. To further promote human occlusion-related tasks, we also present the first 3D occluded motion dataset, OcMotion, which serves as a new benchmark for both training and evaluation. Experimental results demonstrate that our method can produce accurate and coherent human motions from occluded videos. The dataset and code will be publicly available.