Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
Abstract
Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose Foca-VLA, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. Foca-VLA introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a cross-scale routing Mixture-of-Experts (MoE) with impedance control in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct Foca-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that Foca-VLA substantially improves success rates and reliability in contact-rich manipulation, outperforming Pi0 and Pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby advancing force-aware physical intelligence in VLAs.