Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 642

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

Zhenyu Li ⋅ Sai Kumar Dwivedi ⋅ Filip Maric ⋅ Carlos Chacón ⋅ Nadine Bertsch ⋅ Filippo Arcadu ⋅ Tomas Hodan ⋅ Michael Ramamonjisoa ⋅ Peter Wonka ⋅ Amy Zhao ⋅ Robin Kips ⋅ Cem Keskin ⋅ Anastasia Tkach ⋅ Chenhongyi Yang

Highlight

Project Page Paper PDF

Abstract

Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XR-Poser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher–student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. In experiments on the EgoBody3M benchmark, XR-Poser outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%, respectively. Furthermore, our auto-labeling system additionally improves the wrist MPJPE by 13.1%.