MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs
Abstract
3D hand pose estimation (HPE) from sparse inertial measurement units (IMUs) has shown great potential in human-computer interaction. However, due to the significant semantic gap between sparse local motion information and structured global pose information, estimating the hand poses from sparse IMU signals is ambiguous and challenging. Knowledge distillation can transfer rich knowledge from the stronger teacher to the student, so that the student enhances performance. Existing approaches distill morphological priors into the IMU-based student model, effectively improving its accuracy in complex scenarios. Nevertheless, overlooking the visual-inertial inherent semantic mismatch and information density difference leads to difficulties for students to learn coupled priors. In this paper, we propose a \textbf{M}ulti-\textbf{G}ranularity Prior-to-Inertial \textbf{D}istillation Framework for Sequential 3D \textbf{H}PE from Sparse IMUs (\textbf{MGDHand}). We first pre-train a MANO-IMU fusion model as a teacher to encode static geometric morphology prior, dynamic kinematic prior and temporal motion prior. Then, a \textbf{M}ulti-\textbf{G}ranularity Decoupled \textbf{Distill}ation (\textbf{MGDistill}) scheme is proposed to bridge the semantic gap. MGDistill includes a \textbf{Static Shape Distillation} module to transfer time-invariant hand shape priors, and a \textbf{Dynamic Pose Distillation} module to transfer complex joint kinematics and dense pose priors. Additionally, a \textbf{Temporal Motion Distillation} module transfers the fast-changing motion priors (velocity and acceleration). Extensive experiments on public dataset demonstrate that our method outperforms state-of-the-art approaches under sparse IMU configurations.