Track: Oral Session 4B: Embodied Computer Vision

Sat 14 June 11:00 - 11:15 PDT

PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation

Jingyi Tian · Le Wang · Sanping Zhou · Sen Wang · lijiayi · Haowen Sun · Wei Tang

Robotic manipulation based on visual observations and natural language instructions is a long-standing challenge in robotics. Yet prevailing approaches model action distribution by adopting explicit or implicit representations, which often struggle to achieve a trade-off between accuracy and efficiency. In response, we propose PDFactor, a novel framework that models action distribution with a hybrid triplane representation. In particular, PDFactor decomposes 3D point cloud into three orthogonal feature planes and leverages a tri-perspective view transformer to produce dense cubic features as a latent diffusion field aligned with observation space representing 6-DoF action probability distribution at an arbitrary location. We employ a small denoising network conceptually as both a parameterized loss function measuring the quality of the learned latent features and an action gradient decoder to sample actions from the latent diffusion field during inference. This design enables our PDFactor to benefit from spatial awareness of explicit representation and arbitrary resolution of implicit representation, rendering it with manipulation accuracy, inference efficiency, and model scalability. Experiments demonstrate that PDFactor outperforms state-of-the-art approaches across a diverse range of manipulation tasks in RLBench simulation. Moreover, PDFactor can effectively learn multi-task policies from a limited number of human demonstrations, achieving promising accuracy in a variety of real-world manipulation tasks.

Sat 14 June 11:15 - 11:30 PDT

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song · Valts Blukis · Jonathan Tremblay · Stephen Tyree · Yu Su · Stan Birchfield

Spatial understanding is a crucial capability for robots to make grounded decisions based on their environment. This foundational skill enables robots not only to perceive their surroundings but also to reason about and interact meaningfully within the world. In modern robotics, these capabilities are taken on by visual language models, and they face significant challenges when applied to spatial reasoning context due to their training data sources. These sources utilize general-purpose image datasets, and they often lack sophisticated spatial scene understanding capabilities. For example, the datasets do not address reference frame comprehension — spatial relationships require clear contextual understanding, whether from a ego-centric, object-centric, or world-centric perspective, which allow for effective real-world interaction. To address this issue, we introduce RoboSpatial, a large-scale spatial understanding dataset consisting of real indoor and tabletop scenes captured as 3D scans and ego-centric images, annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5K 3D scans, and 3M annotated spatial relationships, with paired 2D egocentric images and 3D scans to make it both 2D and 3D ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robotics manipulation.

Sat 14 June 11:30 - 11:45 PDT

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Jieming Cui · Tengyu Liu · Ziyu Meng · Jiale Yu · Ran Song · Wei Zhang · Yixin Zhu · Siyuan Huang

Learning open-vocabulary physical skills for simulated agents remains challenging due to the limitations of reinforcement learning approaches: manually designed rewards lack scalability, while demonstration-based methods struggle to cover arbitrary tasks. We propose GROVE, a generalized reward framework for open-vocabulary physical skill learning without manual reward design or task-specific demonstrations. GROVE uniquely combines Large Language Models (LLMs) for generating precise constraints with Vision Language Models (VLMs) for semantic evaluation. Through an iterative reward design process, VLM-based feedback guides the refinement of LLM-generated constraints, significantly enhancing the reliability of our method. Central to our approach is Pose2CLIP, a lightweight pose-to-semantic feature mapper that significantly enhances the quality and efficiency of VLM evaluation. Extensive experiments demonstrate GROVE's versatility across diverse tasks and learning paradigms. Our approach achieves 22.2% higher naturalness and 25.7% better task completion score while training 8.4 times faster than previous open-vocabulary methods, establishing a new foundation for scalable physical skill acquisition.

Sat 14 June 11:45 - 12:00 PDT

Award Candidate

Navigation World Models

Amir Bar · Gaoyue Zhou · Danny Tran · Trevor Darrell · Yann LeCun

Navigation is a fundamental skill of agents with visual-motor capabilities. We propose a Navigation World Model (NWM), a controllable video generation model that predicts the future visual observation given the past observations and navigation actions. NWM is a Conditional Diffusion Transformer (CDiT) trained on the video footage of robots as well as unlabeled egocentric video data. We scale the model up to 1B parameters and train it over human and robot agents data from numerous environments and embodiments. Our model scales favorably on known and unknown environments and can leverage unlabeled egocentric video data. NWM exhibits improved navigation planning skills either by planning from scratch or by ranking proposals from an external navigation policy. Compared to existing supervised navigation models which are ``hard coded'', NWM can incorporate new constraints when planning trajectories. NWM learns visual priors that enable it to imagine navigation trajectories based on just a single input image.

Sat 14 June 12:00 - 12:15 PDT

Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning

Mi Luo · Zihui Xue · Alex Dimakis · Kristen Grauman

Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical for applications in augmented reality and robotics. We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignments between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and egocentric video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.