EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment
Abstract
Recent Vision-Language-Action (VLA) models map visual-textual inputs to robotic actions via end-to-end architectures, yet this approach entangles visual understanding with task-specific actions. This leads to an exhaustive collection of full operational sequences and parameter redundancy across tasks, while generic third-person camera setups require fine-tuning for different hardware due to implicit hand-eye assumptions. We argue that decoupling \textbf{how robots see} from \textbf{how robots act} is a missing primitive in VLA systems. We present \textbf{EgoRoC}, a plug-and-play egocentric alignment head that precedes any task policy and exposes only a thin 6-DoF pose interface. EgoRoC establishes task-agnostic viewpoint consistency from a wrist-mounted (first-person) camera and then alternates alignment with manipulation, while a diffusion-based online hand–eye module corrects the action in the end-effector frame for hardware-agnostic deployment. Trained once from static wrist–target image pairs with relative poses, rather than full manipulation trajectories, EgoRoC leaves downstream VLAs unchanged. By turning egocentric alignment into a reusable capability, EgoRoC reduces training redundancy, strengthens zero-shot cross-scene transfer, and scales across VLA backbones without manual calibration.Across simulation and real settings, attaching EgoRoC consistently boosts success rates, especially on long-horizon and out-of-distribution tasks, and improves data efficiency during fine-tuning.