DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
Abstract
Articulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (DIsCrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the ground-truth pose.To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure.We validate DICArt on both synthetic and real-world datasets with multi-hinged articulated objects. Experimental results demonstrate its superior performance and robustness over state-of-the-art baselines. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments. Codewill be publicly available upon acceptance.