CVPR Poster PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization

Poster

PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization

Yanlu Cai · Weizhong Zhang · Yuan Wu · Cheng Jin

Arch 4A-E Poster #189

[ Abstract ] [ Paper PDF ]

[Paper PDF]

Abstract:

Camera-parameter-free multi-view pose estimation is an emerging technique for 3D human pose estimation (HPE). They can infer the camera settings implicitly or explicitly to mitigate the depth uncertainty impact, showcasing significant potential in real applications. However, due to the limited camera setting diversity in the available datasets, the inferred camera parameters are always simply hardcoded into the model during training and not adaptable to the input in inference, making the learned models cannot generalize well under unseen camera settings. A natural solution is to artificially synthesize some samples, i.e., 2D-3D pose pairs, under massive new camera settings. Unfortunately, to prevent over-fitting the existing camera setting, the number of synthesized samples for each new camera setting should be comparable with that for the existing one, which multiplies the scale of training and even makes it computationally prohibitive. In this paper, we propose a novel HPE approach under the invariant risk minimization (IRM) paradigm. Precisely, we first synthesize 2D poses from myriad camera settings. We then train our model under the IRM paradigm, which targets at learning a common optimal model across all camera settings and thus enforces the model to automatically learn the camera parameters based on the input data. This allows the model to accurately infer 3D poses on unseen data by training on only a handful of samples from each synthesized setting and thus avoid the unbearable training cost increment. Another appealing feature of our method is that benefited from the capabilty of IRM in identifying the invariant features, its performance on the seen camera settings is enhanced as well. Comprehensive experiments on both Human3.6M and TotalCapture datasets clearly attest to the superiority of our approach.

Chat is not available.