SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robot
MENGZHEN LIU ⋅ Enshen Zhou ⋅ Cheng Chi ⋅ Yi Han ⋅ Shanyu Rong ⋅ Liming Chen ⋅ Pengwei Wang ⋅ Zhongyuan Wang ⋅ Shanghang Zhang
Abstract
Active perception and manipulation are crucial for embodied robots to interact with complex scenes. Existing methods struggle to unify semantic-driven perception actively with robust, viewpoint-invariant execution accordingly. To this end, we propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Central to our approach is a decoupling of camera and manipulation actions, contrary to shared-action-space, and learning in a bottom-up strategy: we first train semantic camera control on our proposed large-scale dataset, then jointly optimize both action types via hybrid data. To support this learning, we introduce ActiveViewPose-200K, comprising 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We further present ActiveManip-Bench, the first benchmark filling the gap to evaluate active manipulation. Extensive experiments in both simulation and real-world settings show that SaPaVe outperforms recent VLA models such as GR00T and $\pi_0$, achieving up to 31.25\% higher success rates in real-world tasks. Our results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation.
Successful Page Load