ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
Abstract
Pose estimation remains challenging under sparse views, especially when visual overlap across images is extremely limited. Recent advances in video generation models offer a promising solution by enabling keyframe interpolation, which can enrich contextual cues and improve pose estimation performance. However, existing video generation models often lack 3D consistency, producing temporally plausible but spatially inconsistent frames that degrade downstream pose estimation. In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extreme-view settings. Specifically, we fine-tune a video generation model using Group Relative Preference Optimization (GRPO), aligning its outputs with 3D-consistent supervisory signals derived from pose estimation objectives. Our approach not only enhances the quality of temporal interpolation, but also ensures spatial coherence across views, significantly improving pose estimation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines, highlighting the potential of preference-optimized video generation as a powerful tool for pose estimation in extreme-view scenarios.