CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction
Abstract
Reconstructing 3D human-object interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sample 3D vertex features from both views and introduce a bidirectional cross-view Transformer to integrate multi-view vertex representations for accurate contact estimation. Finally, the predicted contact maps are leveraged to refine human-object meshes, yielding geometrically consistent and physically plausible reconstructions.Experiments on BEHAVE and InterCap show that our proposed CrossHOI surpasses state-of-the-art methods in both reconstruction accuracy and contact prediction, especially under severe occlusions.